1
|
Karakaya O, Kilimci ZH. An efficient consolidation of word embedding and deep learning techniques for classifying anticancer peptides: FastText+BiLSTM. PeerJ Comput Sci 2024; 10:e1831. [PMID: 38435607 PMCID: PMC10909209 DOI: 10.7717/peerj-cs.1831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 12/31/2023] [Indexed: 03/05/2024]
Abstract
Anticancer peptides (ACPs) are a group of peptides that exhibit antineoplastic properties. The utilization of ACPs in cancer prevention can present a viable substitute for conventional cancer therapeutics, as they possess a higher degree of selectivity and safety. Recent scientific advancements generate an interest in peptide-based therapies which offer the advantage of efficiently treating intended cells without negatively impacting normal cells. However, as the number of peptide sequences continues to increase rapidly, developing a reliable and precise prediction model becomes a challenging task. In this work, our motivation is to advance an efficient model for categorizing anticancer peptides employing the consolidation of word embedding and deep learning models. First, Word2Vec, GloVe, FastText, One-Hot-Encoding approaches are evaluated as embedding techniques for the purpose of extracting peptide sequences. Then, the output of embedding models are fed into deep learning approaches CNN, LSTM, BiLSTM. To demonstrate the contribution of proposed framework, extensive experiments are carried on widely-used datasets in the literature, ACPs250 and independent. Experiment results show the usage of proposed model enhances classification accuracy when compared to the state-of-the-art studies. The proposed combination, FastText+BiLSTM, exhibits 92.50% of accuracy for ACPs250 dataset, and 96.15% of accuracy for the Independent dataset, thence determining new state-of-the-art.
Collapse
Affiliation(s)
- Onur Karakaya
- Research and Development Inc., Turkcell Technology, İstanbul, Turkey
| | - Zeynep Hilal Kilimci
- Department of Information Systems Engineering, Kocaeli University, Kocaeli, Turkey
| |
Collapse
|
2
|
Di Natale A, Garcia D. LEXpander: Applying colexification networks to automated lexicon expansion. Behav Res Methods 2024; 56:952-967. [PMID: 36897503 PMCID: PMC10000354 DOI: 10.3758/s13428-023-02063-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/06/2023] [Indexed: 03/11/2023]
Abstract
Recent approaches to text analysis from social media and other corpora rely on word lists to detect topics, measure meaning, or to select relevant documents. These lists are often generated by applying computational lexicon expansion methods to small, manually curated sets of seed words. Despite the wide use of this approach, we still lack an exhaustive comparative analysis of the performance of lexicon expansion methods and how they can be improved with additional linguistic data. In this work, we present LEXpander, a method for lexicon expansion that leverages novel data on colexification, i.e., semantic networks connecting words with multiple meanings according to shared senses. We evaluate LEXpander in a benchmark including widely used methods for lexicon expansion based on word embedding models and synonym networks. We find that LEXpander outperforms existing approaches in terms of both precision and the trade-off between precision and recall of generated word lists in a variety of tests. Our benchmark includes several linguistic categories, as words relating to the financial area or to the concept of friendship, and sentiment variables in English and German. We also show that the expanded word lists constitute a high-performing text analysis method in application cases to various English corpora. This way, LEXpander poses a systematic automated solution to expand short lists of words into exhaustive and accurate word lists that can closely approximate word lists generated by experts in psychology and linguistics.
Collapse
Affiliation(s)
- Anna Di Natale
- Institute of Interactive Systems and Data Science, Graz University of Technology, Inffeldgasse 16c/I, Graz, 8010, Austria.
- Section for Science of Complex Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria.
- Complexity Science Hub Vienna, Josefstädter Straße 39, 1080, Vienna, Austria.
| | - David Garcia
- Institute of Interactive Systems and Data Science, Graz University of Technology, Inffeldgasse 16c/I, Graz, 8010, Austria
- Section for Science of Complex Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria
- Complexity Science Hub Vienna, Josefstädter Straße 39, 1080, Vienna, Austria
- Department of Politics and Public Administration, University of Konstanz, Universitätsstraße 10, 78464, Konstanz, Germany
| |
Collapse
|
3
|
Koltcov S, Surkov A, Filippov V, Ignatenko V. Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics. PeerJ Comput Sci 2024; 10:e1758. [PMID: 38196953 PMCID: PMC10773852 DOI: 10.7717/peerj-cs.1758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Accepted: 11/26/2023] [Indexed: 01/11/2024]
Abstract
Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics (a model parameter) remains a challenging task. We aim to partially fill this gap by testing four well-known and available to a wide range of users topic models such as the embedded topic model (ETM), Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTM-GMM). We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability that complicates their application in practice. ETM model with additionally trained embeddings demonstrates high coherence and rather good stability for large datasets, but the question of the number of topics remains unsolved for this model. We also propose a new topic model based on granulated sampling with word embeddings (GLDAW), demonstrating the highest stability and good coherence compared to other considered models. Moreover, the optimal number of topics in a dataset can be determined for this model.
Collapse
Affiliation(s)
- Sergei Koltcov
- Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, Saint-Petersburg, Russia
| | - Anton Surkov
- Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, Saint-Petersburg, Russia
| | - Vladimir Filippov
- Scientific Research Institute for Optoelectronic Instrument Engineering, Sosnovy Bor, Leningrad Region, Russia
| | - Vera Ignatenko
- Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, Saint-Petersburg, Russia
| |
Collapse
|
4
|
Ash E, Stammbach D, Tobia K. What is (and was) a person? Evidence on historical mind perceptions from natural language. Cognition 2023; 239:105501. [PMID: 37480835 DOI: 10.1016/j.cognition.2023.105501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 05/23/2023] [Accepted: 05/24/2023] [Indexed: 07/24/2023]
Abstract
An important philosophical tradition identifies persons as those entities that have minds, such that mind perception is a window into person perception. Psychological research has found that human perceptions of mind consist of at least two distinct dimensions: agency (e.g. planning, deciding) and experience (e.g. feeling, hungering). Taking this insight into the semantic space of natural language, we develop a generalizable, scalable computational-linguistics method for measuring variation in perceived agency and experience in large archives of plain-text documents. The resulting text-based rankings of entities along these dimensions correspond to human judgments of perceived agency and experience assessed in blind surveys. We then map both dimensions of mind in historical English-language corpora over the last 200 years and identify two salient trends. First, we find that while women are now described as having similar levels of agency as men, they are still described as more experience-oriented. Second, we find that domesticated animals have gained higher attributions of experience (but not agency) relative to wild animals, especially since the rise of the global animal rights movement in the 1980s.
Collapse
Affiliation(s)
| | | | - Kevin Tobia
- Georgetown University, United States of America.
| |
Collapse
|
5
|
Sepahpour-Fard M, Quayle M, Schuld M, Yasseri T. Using word embeddings to analyse audience effects and individual differences in parenting Subreddits. EPJ Data Sci 2023; 12:38. [PMID: 37745193 PMCID: PMC10511593 DOI: 10.1140/epjds/s13688-023-00412-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 08/21/2023] [Indexed: 09/26/2023]
Abstract
This paper explores how individuals' language use in gender-specific groups ("mothers" and "fathers") compares to their interactions when referred to as "parents." Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.
Collapse
Affiliation(s)
- Melody Sepahpour-Fard
- Science Foundation Ireland Centre for Research Training in Foundations of Data Science, Limerick, Ireland
- Department of Mathematics and Statistics, University of Limerick, Castletroy, Limerick, Ireland
| | - Michael Quayle
- Centre for Social Issues Research, University of Limerick, Castletroy, Limerick, Ireland
- Department of Psychology, University of Limerick, Castletroy, Limerick, Ireland
- Department of Psychology, School of Applied Human Sciences, University of KwaZulu-Natal, Durban, KwaZulu-Natal South Africa
| | - Maria Schuld
- Department of Psychology, University of Johannesburg, Johannesburg, South Africa
| | - Taha Yasseri
- School of Sociology, University College Dublin, Dublin, Ireland
- Geary Institute for Public Policy, University College Dublin, Dublin, Ireland
| |
Collapse
|
6
|
Campillos-Llanos L. MedLexSp - a medical lexicon for Spanish medical natural language processing. J Biomed Semantics 2023; 14:2. [PMID: 36732862 PMCID: PMC9892682 DOI: 10.1186/s13326-022-00281-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 12/03/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish. CONSTRUCTION AND CONTENT This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System[Formula: see text] (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries. CONCLUSIONS The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
Collapse
Affiliation(s)
- Leonardo Campillos-Llanos
- Instituto de Lengua, Literatura y Antropología (ILLA), CSIC (Spanish National Research Council), Albasanz 26-28, 28037, Madrid, Spain.
| |
Collapse
|
7
|
Botarleanu RM, Dascalu M, Watanabe M, Crossley SA, McNamara DS. Age of Exposure 2.0: Estimating word complexity using iterative models of word embeddings. Behav Res Methods 2022; 54:3015-3042. [PMID: 35167112 DOI: 10.3758/s13428-022-01797-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/12/2022] [Indexed: 12/16/2022]
Abstract
Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase the cost and effort to produce. In this paper, we introduce Age of Exposure (AoE) version 2, a proxy for human exposure to new vocabulary terms that expands AoA word lists through training regressors to predict AoA scores. Word2vec word embeddings are trained on cumulatively increasing corpora of texts, word exposure trajectories are generated by aligning the word2vec vector spaces, and features of words are derived for modeling AoA scores. Our prediction models achieve low errors (from 13% with a corresponding R2 of .35 up to 7% with an R2 of .74), can be uniformly applied to different AoA word lists, and generalize to the entire vocabulary of a language. Our method benefits from using existing readability indices to define the order of texts in the corpora, while the performed analyses confirm that the generated AoA scores accurately predicted the difficulty of texts (R2 of .84, surpassing related previous work). Further, we provide evidence of the internal reliability of our word trajectory features, demonstrate the effectiveness of the word trajectory features when contrasted with simple lexical features, and show that the exclusion of features that rely on external resources does not significantly impact performance.
Collapse
Affiliation(s)
| | - Mihai Dascalu
- University Politehnica of Bucharest, Bucharest, Romania.
- Academy of Romanian Scientists, Bucharest, Romania.
| | | | | | | |
Collapse
|
8
|
Hörberg T, Larsson M, Olofsson JK. The Semantic Organization of the English Odor Vocabulary. Cogn Sci 2022; 46:e13205. [PMID: 36334010 DOI: 10.1111/cogs.13205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 09/06/2022] [Accepted: 09/16/2022] [Indexed: 11/11/2022]
Abstract
The vocabulary for describing odors in English natural language is not well understood, as prior studies of odor descriptions have often relied on preselected descriptors and odor ratings. Here, we present a data-driven approach that automatically identifies English odor descriptors based on their degree of olfactory association, and derive their semantic organization from their distributions in natural texts, using a distributional-semantic language model. We identify 243 descriptors that are much more strongly associated with olfaction than English words in general. We then derive the semantic organization of these olfactory descriptors, and find that it is captured by four clusters that we name Offensive, Malodorous, Fragrant, and Edible. The semantic space derived from our model primarily differentiates descriptors in terms of pleasantness and edibility along which our four clusters are positioned, and is similar to a space derived from perceptual data. The semantic organization of odor vocabulary can thus be mapped using natural language data (e.g., online text), without the limitations of odor-perceptual data and preselected descriptors. Our method may thus facilitate research on olfaction, a sensory system known to often elude verbal description.
Collapse
|
9
|
Jiang H, Frank MC, Kulkarni V, Fourtassi A. Exploring Patterns of Stability and Change in Caregivers' Word Usage Across Early Childhood. Cogn Sci 2022; 46:e13177. [PMID: 35820173 DOI: 10.1111/cogs.13177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Revised: 04/22/2022] [Accepted: 06/11/2022] [Indexed: 11/26/2022]
Abstract
The linguistic input children receive across early childhood plays a crucial role in shaping their knowledge about the world. To study this input, researchers have begun applying distributional semantic models to large corpora of child-directed speech, extracting various patterns of word use/co-occurrence. Previous work using these models has not measured how these patterns may change throughout development, however. In this work, we leverage natural language processing methods-originally developed to study historical language change-to compare caregivers' use of words when talking to younger versus older children. Some words' usage changed more than others; this variability could be predicted based on the word's properties at both the individual and category levels. These findings suggest that caregivers' changing patterns of word use may play a role in scaffolding children's acquisition of conceptual structure in early development.
Collapse
Affiliation(s)
- Hang Jiang
- Symbolic Systems Program, Stanford University
| | | | | | | |
Collapse
|
10
|
Goldberg DM. Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability. J Safety Res 2022; 80:441-455. [PMID: 35249625 DOI: 10.1016/j.jsr.2021.12.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 07/12/2021] [Accepted: 12/20/2021] [Indexed: 06/14/2023]
Abstract
INTRODUCTION Ensuring occupational health and safety is an enormous concern for organizations, as accidents not only harm workers but also result in financial losses. Analysis of accident data has the potential to reveal insights that may improve capabilities to mitigate future accidents. However, because accident data are often transcribed textually, analyzing these narratives proves difficult. This study contributes to a recent stream of literature utilizing machine learning to automatically label accident narratives, converting them into more easily analyzable fields. METHOD First, a large dataset of accident narratives in which workers were injured is collected from the U.S. Occupational Safety and Health Administration (OSHA). Word embeddings-based text mining is implemented; compared to past works, this methodology offers excellent performance. Second, to improve the richness of analyses, each record is assessed across five dimensions. The machine learning models provide classifications of body part(s) injured, the source of the injury, the type of event causing the injury, whether a hospitalization occurred, and whether an amputation occurred. Finally, demonstrating generalizability, the trained models are deployed to analyze two additional datasets of accident narratives in the construction industry and the mining and metals industry (transfer learning). Practical Applications: These contributions improve organizations' capacities to rapidly analyze textual accident narratives.
Collapse
Affiliation(s)
- David M Goldberg
- San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, United States.
| |
Collapse
|
11
|
Flamholz ZN, Crane-Droesch A, Ungar LH, Weissman GE. Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information. J Biomed Inform 2022; 125:103971. [PMID: 34920127 PMCID: PMC8766939 DOI: 10.1016/j.jbi.2021.103971] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 11/22/2021] [Accepted: 12/02/2021] [Indexed: 01/03/2023]
Abstract
OBJECTIVE Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Andrew Crane-Droesch
- Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, Pennsylvania, USA,Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Lyle H. Ungar
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Gary E. Weissman
- Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Leonard Davis Institute of Health Economics, University of Pennsylvania, Philadelphia, Pennsylvania, USA,Pulmonary, Allergy, and Critical Care Division, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| |
Collapse
|
12
|
López-Úbeda P, Díaz-Galiano MC, Ureña-López LA, Martín-Valdivia MT. Combining word embeddings to extract chemical and drug entities in biomedical literature. BMC Bioinformatics 2021; 22:599. [PMID: 34920708 PMCID: PMC8684055 DOI: 10.1186/s12859-021-04188-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 05/12/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.
Collapse
Affiliation(s)
- Pilar López-Úbeda
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain.
| | - Manuel Carlos Díaz-Galiano
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| | - L Alfonso Ureña-López
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| | - M Teresa Martín-Valdivia
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| |
Collapse
|
13
|
Balakrishnan V, Shi Z, Law CL, Lim R, Teh LL, Fan Y. A deep learning approach in predicting products' sentiment ratings: a comparative analysis. J Supercomput 2021; 78:7206-7226. [PMID: 34754140 PMCID: PMC8569508 DOI: 10.1007/s11227-021-04169-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 10/21/2021] [Indexed: 05/27/2023]
Abstract
We present a benchmark comparison of several deep learning models including Convolutional Neural Networks, Recurrent Neural Network and Bi-directional Long Short Term Memory, assessed based on various word embedding approaches, including the Bi-directional Encoder Representations from Transformers (BERT) and its variants, FastText and Word2Vec. Data augmentation was administered using the Easy Data Augmentation approach resulting in two datasets (original versus augmented). All the models were assessed in two setups, namely 5-class versus 3-class (i.e., compressed version). Findings show the best prediction models were Neural Network-based using Word2Vec, with CNN-RNN-Bi-LSTM producing the highest accuracy (96%) and F-score (91.1%). Individually, RNN was the best model with an accuracy of 87.5% and F-score of 83.5%, while RoBERTa had the best F-score of 73.1%. The study shows that deep learning is better for analyzing the sentiments within the text compared to supervised machine learning and provides a direction for future work and research.
Collapse
Affiliation(s)
- Vimala Balakrishnan
- Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | - Zhongliang Shi
- Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | | | - Regine Lim
- Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | | | - Yue Fan
- Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| |
Collapse
|
14
|
Abstract
Natural language processing (NLP) is a subfield of computer science and linguistics that can be applied to extract meaningful information from radiology reports. Symbolic NLP is rule based and well suited to problems that can be explicitly defined by a set of rules. Statistical NLP is better situated to problems that cannot be well defined and requires annotated or labeled examples from which machine learning algorithms can infer the rules. Both symbolic and statistical NLP have found success in a variety of radiology use cases. More recently, deep learning approaches, including transformers, have gained traction and demonstrated good performance.
Collapse
Affiliation(s)
- Jackson Steinkamp
- Department of Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Tessa S Cook
- Perelman School of Medicine at the University of Pennsylvania, 3400 Spruce Street, 1 Silverstein Radiology, Philadelphia, PA 19104, USA.
| |
Collapse
|
15
|
Abstract
BACKGROUND Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. OBJECTIVE Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications. METHODS We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. RESULTS Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board. CONCLUSION We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, United States of America.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States of America; Department of Computer Science, University of Kentucky, United States of America.
| |
Collapse
|
16
|
Hasni S, Faiz S. Word embeddings and deep learning for location prediction: tracking Coronavirus from British and American tweets. Soc Netw Anal Min 2021; 11:66. [PMID: 34335992 DOI: 10.1007/s13278-021-00777-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Revised: 04/15/2021] [Accepted: 07/16/2021] [Indexed: 11/08/2022]
Abstract
With the propagation of the Coronavirus pandemic, current trends on determining its individual and societal impacts become increasingly important. Recent researches grant special attention to the Coronavirus social networks infodemic to study such impacts. For this aim, we think that applying a geolocation process is crucial before proceeding to the infodemic management. In fact, the spread of reported events and actualities on social networks makes the identification of infected areas or locations of the information owners more challenging especially at a state level. In this paper, we focus on linguistic features to encode regional variations from short and noisy texts such as tweets to track this disease. We pay particular attention to contextual information for a better encoding of these features. We refer to some neural network-based models to capture relationships between words according to their contexts. Being examples of these models, we evaluate some word embedding ones to determine the most effective features’ combination that has more spatial evidence. Then, we ensure a sequential modeling of words for a better understanding of contextual information using recurrent neural networks. Without defining restricted sets of local words in relation to the Coronavirus disease, our framework called DeepGeoloc demonstrates its ability to geolocate both tweets and twitterers. It also makes it possible to capture geosemantics of nonlocal words and to delimit the sparse use of local ones particularly in retweets and reported events. Compared to some baselines, DeepGeoloc achieved competitive results. It also proves its scalability to handle large amounts of data and to geolocate new tweets even those describing new topics in relation to this disease.
Collapse
|
17
|
Hassan J, Tahir MA, Ali A. Natural language understanding of map navigation queries in Roman Urdu by joint entity and intent determination. PeerJ Comput Sci 2021; 7:e615. [PMID: 34395860 PMCID: PMC8323726 DOI: 10.7717/peerj-cs.615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 06/08/2021] [Indexed: 06/13/2023]
Abstract
Navigation based task-oriented dialogue systems provide users with a natural way of communicating with maps and navigation software. Natural language understanding (NLU) is the first step for a task-oriented dialogue system. It extracts the important entities (slot tagging) from the user's utterance and determines the user's objective (intent determination). Word embeddings are the distributed representations of the input sentence, and encompass the sentence's semantic and syntactic representations. We created the word embeddings using different methods like FastText, ELMO, BERT and XLNET; and studied their effect on the natural language understanding output. Experiments are performed on the Roman Urdu navigation utterances dataset. The results show that for the intent determination task XLNET based word embeddings outperform other methods; while for the task of slot tagging FastText and XLNET based word embeddings have much better accuracy in comparison to other approaches.
Collapse
Affiliation(s)
- Javeria Hassan
- National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Muhammad Ali Tahir
- National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Adnan Ali
- University of Science and Technology of China, Hefei, Anhui, China
| |
Collapse
|
18
|
Wright AP, Jones CM, Chau DH, Matthew Gladden R, Sumner SA. Detection of emerging drugs involved in overdose via diachronic word embeddings of substances discussed on social media. J Biomed Inform 2021; 119:103824. [PMID: 34048933 PMCID: PMC10901232 DOI: 10.1016/j.jbi.2021.103824] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 05/21/2021] [Accepted: 05/23/2021] [Indexed: 11/29/2022]
Abstract
Substances involved in overdose deaths have shifted over time and continue to undergo transition. Early detection of emerging drugs involved in overdose is a major challenge for traditional public health data systems. While novel social media data have shown promise, there is a continued need for robust natural language processing approaches that can identify emerging substances. Consequently, we developed a new metric, the relative similarity ratio, based on diachronic word embeddings to measure movement in the semantic proximity of individual substance words to 'overdose' over time. Our analysis of 64,420,376 drug-related posts made between January 2011 and December 2018 on Reddit, the largest online forum site, reveals that this approach successfully identified fentanyl, the most significant emerging substance in the overdose epidemic, >1 year earlier than traditional public health data systems. Use of diachronic word embeddings may enable improved identification of emerging substances involved in drug overdose, thereby improving the timeliness of prevention and treatment activities.
Collapse
Affiliation(s)
- Austin P Wright
- School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, USA; Office of Strategy and Innovation, National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, USA
| | - Christopher M Jones
- Office of Strategy and Innovation, National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, USA
| | - Duen Horng Chau
- School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, USA
| | - R Matthew Gladden
- Division of Overdose Prevention, National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, USA
| | - Steven A Sumner
- Office of Strategy and Innovation, National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, USA.
| |
Collapse
|
19
|
Bauer C, Herwig R, Lienhard M, Prasse P, Scheffer T, Schuchhardt J. Large-scale literature mining to assess the relation between anti-cancer drugs and cancer types. J Transl Med 2021; 19:274. [PMID: 34174885 PMCID: PMC8236166 DOI: 10.1186/s12967-021-02941-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/13/2021] [Indexed: 12/09/2022] Open
Abstract
Background There is a huge body of scientific literature describing the relation between tumor types and anti-cancer drugs. The vast amount of scientific literature makes it impossible for researchers and physicians to extract all relevant information manually. Methods In order to cope with the large amount of literature we applied an automated text mining approach to assess the relations between 30 most frequent cancer types and 270 anti-cancer drugs. We applied two different approaches, a classical text mining based on named entity recognition and an AI-based approach employing word embeddings. The consistency of literature mining results was validated with 3 independent methods: first, using data from FDA approvals, second, using experimentally measured IC-50 cell line data and third, using clinical patient survival data. Results We demonstrated that the automated text mining was able to successfully assess the relation between cancer types and anti-cancer drugs. All validation methods showed a good correspondence between the results from literature mining and independent confirmatory approaches. The relation between most frequent cancer types and drugs employed for their treatment were visualized in a large heatmap. All results are accessible in an interactive web-based knowledge base using the following link: https://knowledgebase.microdiscovery.de/heatmap. Conclusions Our approach is able to assess the relations between compounds and cancer types in an automated manner. Both, cancer types and compounds could be grouped into different clusters. Researchers can use the interactive knowledge base to inspect the presented results and follow their own research questions, for example the identification of novel indication areas for known drugs. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-021-02941-z.
Collapse
Affiliation(s)
- Chris Bauer
- MicroDiscovery GmbH, Marienburger Straße 1, 10405, Berlin, Germany.
| | - Ralf Herwig
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63, 14195, Berlin, Germany
| | - Matthias Lienhard
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63, 14195, Berlin, Germany
| | - Paul Prasse
- Department of Informatics, University of Potsdam, August-Bebel-Str. 89, 14482, Potsdam, Germany
| | - Tobias Scheffer
- Department of Informatics, University of Potsdam, August-Bebel-Str. 89, 14482, Potsdam, Germany
| | | |
Collapse
|
20
|
Ding X, Mower J, Subramanian D, Cohen T. Augmenting aer2vec: Enriching distributed representations of adverse event report data with orthographic and lexical information. J Biomed Inform 2021; 119:103833. [PMID: 34111555 DOI: 10.1016/j.jbi.2021.103833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 05/10/2021] [Accepted: 06/02/2021] [Indexed: 11/29/2022]
Abstract
Adverse Drug Events (ADEs) are prevalent, costly, and sometimes preventable. Post-marketing drug surveillance aims to monitor ADEs that occur after a drug is released to market. Reports of such ADEs are aggregated by reporting systems, such as the Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS). In this paper, we consider the topic of how best to represent data derived from reports in FAERS for the purpose of detecting post-marketing surveillance signals, in order to inform regulatory decision making. In our previous work, we developed aer2vec, a method for deriving distributed representations (concept embeddings) of drugs and side effects from ADE reports, establishing the utility of distributional information for pharmacovigilance signal detection. In this paper, we advance this line of research further by evaluating the utility of encoding orthographic and lexical information. We do so by adapting two Natural Language Processing methods, subword embedding and vector retrofitting, which were developed to encode such information into word embeddings. Models were compared for their ability to distinguish between positive and negative examples in a set of manually curated drug/ADE relationships, with both aer2vec enhancements offering advantages in performances over baseline models, and best performance obtained when retrofitting and subword embeddings were applied in concert. In addition, this work demonstrates that models leveraging distributed representations do not require extensive manual preprocessing to perform well on this pharmacovigilance signal detection task, and may even benefit from information that would otherwise be lost during the normalization and standardization process.
Collapse
Affiliation(s)
- Xiruo Ding
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA.
| | - Justin Mower
- Department of Computer Science, Rice University, Houston, TX, USA.
| | | | - Trevor Cohen
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA.
| |
Collapse
|
21
|
Koutsomitropoulos DA, Andriopoulos AD. Thesaurus-based word embeddings for automated biomedical literature classification. Neural Comput Appl 2021; 34:937-950. [PMID: 33994670 PMCID: PMC8111057 DOI: 10.1007/s00521-021-06053-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Accepted: 04/15/2021] [Indexed: 11/29/2022]
Abstract
The special nature, volume and broadness of biomedical literature pose barriers for automated classification methods. On the other hand, manually indexing is time-consuming, costly and error prone. We argue that current word embedding algorithms can be efficiently used to support the task of biomedical text classification even in a multilabel setting, with many distinct labels. The ontology representation of Medical Subject Headings provides machine-readable labels and specifies the dimensionality of the problem space. Both deep- and shallow network approaches are implemented. Predictions are determined by the similarity between extracted features from contextualized representations of abstracts and headings. The addition of a separate classifier for transfer learning is also proposed and evaluated. Large datasets of biomedical citations are harvested for their metadata and used for training and testing. These automated approaches are still far from entirely substituting human experts, yet they can be useful as a mechanism for validation and recommendation. Dataset balancing, distributed processing and training parallelization in GPUs, all play an important part regarding the effectiveness and performance of proposed methods.
Collapse
Affiliation(s)
| | - Andreas D Andriopoulos
- Department of Computer Engineering and Informatics, School of Engineering, University of Patras, Patras, Greece
| |
Collapse
|
22
|
Ramos-Vargas RE, Román-Godínez I, Torres-Ramos S. Comparing general and specialized word embeddings for biomedical named entity recognition. PeerJ Comput Sci 2021; 7:e384. [PMID: 33817030 PMCID: PMC7959609 DOI: 10.7717/peerj-cs.384] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 01/14/2021] [Indexed: 06/12/2023]
Abstract
Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
Collapse
|
23
|
Chen TL, Emerling M, Chaudhari GR, Chillakuru YR, Seo Y, Vu TH, Sohn JH. Domain specific word embeddings for natural language processing in radiology. J Biomed Inform 2021; 113:103665. [PMID: 33333323 PMCID: PMC7856086 DOI: 10.1016/j.jbi.2020.103665] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/03/2020] [Accepted: 12/10/2020] [Indexed: 11/25/2022]
Abstract
BACKGROUND There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus. PURPOSE We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text. MATERIALS AND METHODS Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance. RESULTS For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively). CONCLUSION We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.
Collapse
Affiliation(s)
- Timothy L Chen
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of Illinois College of Medicine, 1853 W Polk St, Chicago, IL 60612, USA
| | - Max Emerling
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; University of California Berkeley, 2626 Hearst Ave, Berkeley, CA 94720, USA
| | - Gunvant R Chaudhari
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA
| | - Yeshwant R Chillakuru
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA; George Washington School Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052, USA
| | - Youngho Seo
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA
| | - Thienkhai H Vu
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA
| | - Jae Ho Sohn
- University of California San Francisco (UCSF), Radiology and Biomedical Imaging, 505 Parnassus Ave, San Francisco, CA 94143, USA.
| |
Collapse
|
24
|
Wang H, Li Y, Khan SA, Luo Y. Prediction of breast cancer distant recurrence using natural language processing and knowledge-guided convolutional neural network. Artif Intell Med 2020; 110:101977. [PMID: 33250149 PMCID: PMC7983067 DOI: 10.1016/j.artmed.2020.101977] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 10/16/2020] [Accepted: 10/21/2020] [Indexed: 12/11/2022]
Abstract
Distant recurrence of breast cancer results in high lifetime risks and low 5-year survival rates. Early prediction of distant recurrent breast cancer could facilitate intervention and improve patients' life quality. In this study, we designed an EHR-based predictive model to estimate the distant recurrent probability of breast cancer patients. We studied the pathology reports and progress notes of 6,447 patients who were diagnosed with breast cancer at Northwestern Memorial Hospital between 2001 and 2015. Clinical notes were mapped to Concept unified identifiers (CUI) using natural language processing tools. Bag-of-words and pre-trained embedding were employed to vectorize words and CUI sequences. These features integrated with clinical features from structured data were downstreamed to conventional machine learning classifiers and Knowledge-guided Convolutional Neural Network (K-CNN). The best configuration of our model yielded an AUC of 0.888 and an F1-score of 0.5. Our work provides an automated method to predict breast cancer distant recurrence using natural language processing and deep learning approaches. We expect that through advanced feature engineering, better predictive performance could be achieved.
Collapse
Affiliation(s)
- Hanyin Wang
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Yikuan Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Seema A Khan
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
25
|
Mensa E, Colla D, Dalmasso M, Giustini M, Mamo C, Pitidis A, Radicioni DP. Violence detection explanation via semantic roles embeddings. BMC Med Inform Decis Mak 2020; 20:263. [PMID: 33059690 PMCID: PMC7559980 DOI: 10.1186/s12911-020-01237-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 09/02/2020] [Indexed: 11/22/2022] Open
Abstract
Background Emergency room reports pose specific challenges to natural language processing techniques. In this setting, violence episodes on women, elderly and children are often under-reported. Categorizing textual descriptions as containing violence-related injuries (V) vs. non-violence-related injuries (NV) is thus a relevant task to the ends of devising alerting mechanisms to track (and prevent) violence episodes. Methods We present ViDeS (so dubbed after Violence Detection System), a system to detect episodes of violence from narrative texts in emergency room reports. It employs a deep neural network for categorizing textual ER reports data, and complements such output by making explicit which elements corroborate the interpretation of the record as reporting about violence-related injuries. To these ends we designed a novel hybrid technique for filling semantic frames that employs distributed representations of terms herein, along with syntactic and semantic information. The system has been validated on real data annotated with two sorts of information: about the presence vs. absence of violence-related injuries, and about some semantic roles that can be interpreted as major cues for violent episodes, such as the agent that committed violence, the victim, the body district involved, etc.. The employed dataset contains over 150K records annotated with class (V,NV) information, and 200 records with finer-grained information on the aforementioned semantic roles. Results We used data coming from an Italian branch of the EU-Injury Database (EU-IDB) project, compiled by hospital staff. Categorization figures approach full precision and recall for negative cases and.97 precision and.94 recall on positive cases. As regards as the recognition of semantic roles, we recorded an accuracy varying from.28 to.90 according to the semantic roles involved. Moreover, the system allowed unveiling annotation errors committed by hospital staff. Conclusions Explaining systems’ results, so to make their output more comprehensible and convincing, is today necessary for AI systems. Our proposal is to combine distributed and symbolic (frame-like) representations as a possible answer to such pressing request for interpretability. Although presently focused on the medical domain, the proposed methodology is general and, in principle, it can be extended to further application areas and categorization tasks.
Collapse
Affiliation(s)
- Enrico Mensa
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy
| | - Davide Colla
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy
| | - Marco Dalmasso
- Servizio sovrazonale di Epidemiologia dell'ASL TO3 della Regione Piemonte, Via Sabaudia 164, Grugliasco (TO), 10095, Italy
| | - Marco Giustini
- Reparto Epidemiologia ambientale e sociale Dipartimento Ambiente e Salute (DAMSA) Istituto Superiore di Sanità, Viale Regina Elena, 299, Roma, 00161, Italy
| | - Carlo Mamo
- Servizio sovrazonale di Epidemiologia dell'ASL TO3 della Regione Piemonte, Via Sabaudia 164, Grugliasco (TO), 10095, Italy
| | - Alessio Pitidis
- Reparto Epidemiologia ambientale e sociale Dipartimento Ambiente e Salute (DAMSA) Istituto Superiore di Sanità, Viale Regina Elena, 299, Roma, 00161, Italy.,Data Analysis Services, B2C Innovation Inc. - Digital Services, Corso Magenta 69/A, Milan, PO Box 20123, Italy
| | - Daniele P Radicioni
- Department of Computer Science, University of Turin, Corso Svizzera 185, Turin, 10149, Italy.
| |
Collapse
|
26
|
Colla D, Mensa E, Radicioni DP. Sense identification data: A dataset for lexical semantics. Data Brief 2020; 32:106267. [PMID: 32984463 PMCID: PMC7494475 DOI: 10.1016/j.dib.2020.106267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 08/25/2020] [Accepted: 08/28/2020] [Indexed: 11/25/2022] Open
Abstract
Sense Identification is a newly proposed task; in considering a pair of terms to assess their conceptual similarity, human raters are postulated to preliminarily select a sense pair. Senses involved in this pair are those actually subject to similarity rating. The sense identification task is searching for the sense selected during the similarity rating. The sense individuation task is important to investigate strategies and sense inventories underlying human lexical access and, moreover, it is a relevant complement to the semantic similarity task. Individuating which senses are involved in the similarity rating is also crucial in order to fully assess those ratings: if we have no idea of which two senses were retrieved, on which base can we assess the score expressing their semantic proximity? The Sense Identification Dataset (SID) dataset has been built to provide a common experimental ground to systems and approaches dealing with the sense identification task. It is the first dataset specifically designed for experimenting on the mentioned task. The SID dataset was created by manually annotating with sense identifiers the term pairs from an existing dataset, the SemEval-2017 Task 2 English dataset. The original dataset was originally conceived for experimenting on the semantic similarity task, and it contains a score expressing the human similarity rating for each term pair. For each such term pair we added a pair of annotated senses: in particular, senses were annotated such that they are compatible (explicative of) with the existing similarity ratings. The SID dataset contains BabelNet sense identifiers. This sense inventory is a broadly adopted 'naming convention' for word senses, and such identifiers can be easily mapped onto further resources such as WordNet and WikiData, thereby enabling further processing tasks and usages in the Natural Language Processing pipeline.
Collapse
Affiliation(s)
- Davide Colla
- Computer Science Department, University of Turin, Italy
| | - Enrico Mensa
- Computer Science Department, University of Turin, Italy
| | | |
Collapse
|
27
|
Abstract
This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec.
Collapse
|
28
|
Lenzi A, Maranghi M, Stilo G, Velardi P. The social phenotype: Extracting a patient-centered perspective of diabetes from health-related blogs. Artif Intell Med 2019; 101:101727. [PMID: 31813490 DOI: 10.1016/j.artmed.2019.101727] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Revised: 06/18/2019] [Accepted: 09/10/2019] [Indexed: 10/25/2022]
Abstract
MOTIVATIONS It has recently been argued [1] that the effectiveness of a cure depends on the doctor-patient shared understanding of an illness and its treatment. Although a better communication between doctor and patient can be pursued through dedicated training programs, or by collecting patients' experiences and symptoms by means of questionnaires, the impact of these actions is limited by time and resources. In this paper we suggest that a patient-centered view of a disease - as well as potential misalignment between patient and doctor focuses - can be inferred at a larger scale through automated textual analysis of health-related forums. People are generating an enormous amount of social data to describe their health care experiences, and continuously search information about diseases, symptoms, diagnoses, doctors, treatment options and medicines. By automatically collecting, analyzing and exploiting this information, it is possible to obtain a more detailed and nuanced vision of patients' experience, that we call the "social phenotype" of diseases. MATERIALS AND METHODS As a use-case for our analysis, we consider diabetes, a widespread disease in most industrialized countries. We create a high quality data sample of diabetic patients' messages in Italy, extracted from popular medical forums during more than 10 years. Next, we use a state-of-the-art topic extraction technique based on generative statistical models improved with word embeddings, to identify the main complications, the frequently reported symptoms and the common concerns of these patients. Finally, in order to detect differences in focus, we compare the results of our analysis with available quality of life (QoL) assessments obtained with standard methodologies, such as questionnaires and survey studies. RESULTS We show that patients with diabetes, when accessing on-line forums, express a perception of their disease in a way that might be noticeably different from what is inferred from published QoL assessments on diabetes. In our study, we found that issues reported to have a daily impact on these patients are diet, glycemic control, drugs and clinical tests. These problems are not commonly considered in QoL assessments, since they are not perceived by doctors as representing severe limitations. Although limited to the case of Italian diabetic patients, we suggest that the methodology described in this paper, which is language and disease agnostic, could be applied to other diseases and countries, since misalignment between doctor and patients, and the importance of collecting unbiased patient perceptions, has been emphasized in many studies ([2,3]inter alia). Extracting the social phenotype of a disease might help acquiring patient-centered information on health care experiences on a much wider scale.
Collapse
Affiliation(s)
- Andrea Lenzi
- Computer Science Departement, Sapienza University of Rome, Via Salaria 113, 00198 Rome, Italy
| | - Marianna Maranghi
- Department of Translational and Precision Medicine, Sapienza University of Rome, Viale del Policlinico 151, 00198 Rome, Italy
| | - Giovanni Stilo
- Department of Engineering and Information Science and Mathematics, University of L'Aquila, Via Vetoio, 67100 L'Aquila, Italy
| | - Paola Velardi
- Computer Science Departement, Sapienza University of Rome, Via Salaria 113, 00198 Rome, Italy.
| |
Collapse
|
29
|
Jiang A, Zubiaga A. Leveraging aspect phrase embeddings for cross-domain review rating prediction. PeerJ Comput Sci 2019; 5:e225. [PMID: 33816878 PMCID: PMC7924723 DOI: 10.7717/peerj-cs.225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 09/06/2019] [Indexed: 06/12/2023]
Abstract
Online review platforms are a popular way for users to post reviews by expressing their opinions towards a product or service, and they are valuable for other users and companies to find out the overall opinions of customers. These reviews tend to be accompanied by a rating, where the star rating has become the most common approach for users to give their feedback in a quantitative way, generally as a Likert scale of 1-5 stars. In other social media platforms like Facebook or Twitter, an automated review rating prediction system can be useful to determine the rating that a user would have given to the product or service. Existing work on review rating prediction focuses on specific domains, such as restaurants or hotels. This, however, ignores the fact that some review domains which are less frequently rated, such as dentists, lack sufficient data to build a reliable prediction model. In this paper, we experiment on 12 datasets pertaining to 12 different review domains of varying level of popularity to assess the performance of predictions across different domains. We introduce a model that leverages aspect phrase embeddings extracted from the reviews, which enables the development of both in-domain and cross-domain review rating prediction systems. Our experiments show that both of our review rating prediction systems outperform all other baselines. The cross-domain review rating prediction system is particularly significant for the least popular review domains, where leveraging training data from other domains leads to remarkable improvements in performance. The in-domain review rating prediction system is instead more suitable for popular review domains, provided that a model built from training data pertaining to the target domain is more suitable when this data is abundant.
Collapse
Affiliation(s)
- Aiqi Jiang
- University of Warwick, Coventry, United Kingdom
| | | |
Collapse
|
30
|
Trivedi G, Hong C, Dadashzadeh ER, Handzel RM, Hochheiser H, Visweswaran S. Identifying incidental findings from radiology reports of trauma patients: An evaluation of automated feature representation methods. Int J Med Inform 2019; 129:81-87. [PMID: 31445293 PMCID: PMC6717529 DOI: 10.1016/j.ijmedinf.2019.05.021] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 03/07/2019] [Accepted: 05/21/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Radiologic imaging of trauma patients often uncovers findings that are unrelated to the trauma. These are termed as incidental findings and identifying them in radiology examination reports is necessary for appropriate follow-up. We developed and evaluated an automated pipeline to identify incidental findings at sentence and section levels in radiology reports of trauma patients. METHODS We created an annotated dataset of 4,181 reports and investigated automated feature representations including traditional word and clinical concept (such as SNOMED CT) representations, as well as word and concept embeddings. We evaluated these representations by using them with traditional classifiers such as logistic regression and with deep learning methods such as convolutional neural networks (CNNs). RESULTS The best performance was observed using word embeddings with CNNs with F1 scores of 0.66 and 0.52 at section and sentence levels respectively. The F1 score was statistically significantly higher for sections compared to sentences (Wilcoxon; Z < 0.001, p < 0.05). Compared to using words alone, the addition of SNOMED CT concepts did not improve performance. At the sentence level, the F1 score improved significantly from 0.46 to 0.52 when using pre-trained embeddings (Wilcoxon; Z < 0.001, p < 0.05). CONCLUSION The results show that the best performance was achieved by using embeddings with CNNs at both sentence and section levels. This provides evidence that such a pipeline is capable of accurately identifying incidental findings in radiology reports in an automated manner.
Collapse
Affiliation(s)
- Gaurav Trivedi
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Charmgil Hong
- School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Esmaeel R Dadashzadeh
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States; Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Robert M Handzel
- Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Harry Hochheiser
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States.
| |
Collapse
|
31
|
Dai HJ, Wang CK. Classifying adverse drug reactions from imbalanced twitter data. Int J Med Inform 2019; 129:122-132. [PMID: 31445246 DOI: 10.1016/j.ijmedinf.2019.05.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2018] [Revised: 04/07/2019] [Accepted: 05/21/2019] [Indexed: 10/26/2022]
Abstract
BACKGROUND Nowadays, social media are often being used by general public to create and share public messages related to their health. With the global increase in social media usage, there is a trend of posting information related to adverse drug reactions (ADR). Mining the social media data for this type of information will be helpful for pharmacological post-marketing surveillance and monitoring. Although the concept of using social media to facilitate pharmacovigilance is convincing, construction of automatic ADR detection systems remains a challenge because the corpora compiled from social media tend to be highly imbalanced, posing a major obstacle to the development of classifiers with reliable performance. METHODS Several methods have been proposed to address the challenge of imbalanced corpora. However, we are not aware of any studies that investigated the effectiveness of the strategies of dealing with the problem of imbalanced data in the context of ADR detection from social media. In light of this, we evaluated a variety of imbalanced techniques and proposed a novel word embedding-based synthetic minority over-sampling technique (WESMOTE), which synthesizes new training examples from the sentence representation based on word embeddings. We compared the performance of all methods on two large imbalanced datasets released for the purpose of detecting ADR posts. RESULTS In comparison with the state-of-the-art approaches, the classifiers that incorporated imbalanced classification techniques achieved comparable or better F-scores. All of our best performing configurations combined random under-sampling with techniques including the proposed WESMOTE, boosting and ensemble, implying that an integration of these approaches with under-sampling provides a reliable solution for large imbalanced social media datasets. Furthermore, ensemble-based methods like vote-based under-sampling (VUE) and random under-sampling boosting can be alternatives for the hybrid synthetic methods because both methods increase the diversity of the created weak classifiers, leading to better recall and overall F-scores for the minority classes. CONCLUSIONS Data collected from the social media are usually very large and highly imbalanced. In order to maximize the performance of a classifier trained on such data, applications of imbalanced strategies are required. We considered several practical methods for handling imbalanced Twitter data along with their performance on the binary classification task with respect to ADRs. In conclusion, the following practical insights are gained: 1) When dealing with text classification, the proposed word embedding-based synthetic minority over-sampling technique is more effective than traditional synthetic-based over-sampling methods. 2) In cases where large amounts of training data are available, the imbalanced strategies combined with under-sampling techniques are preferred. 3) Finally, employment of advanced methods does not guarantee better performance than simpler ones such as VUE, which achieved high performance with advantages like faster building time and ease of development.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, Republic of China; Post Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China.
| | - Chen-Kai Wang
- Big Data laboratories of Chunghwa Telecom Laboratories, Taoyuan, Taiwan, Republic of China.
| |
Collapse
|
32
|
Nguyen TT, Le NQ, Ho QT, Phan DV, Ou YY. Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal Biochem 2019; 577:73-81. [PMID: 31022378 DOI: 10.1016/j.ab.2019.04.011] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/02/2019] [Accepted: 04/12/2019] [Indexed: 02/08/2023]
Abstract
Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
Collapse
|
33
|
Abstract
Background Clinical text classification is an fundamental problem in medical natural language processing. Existing studies have cocnventionally focused on rules or knowledge sources-based feature engineering, but only a limited number of studies have exploited effective representation learning capability of deep learning methods. Methods In this study, we propose a new approach which combines rule-based features and knowledge-guided deep learning models for effective disease classification. Critical Steps of our method include recognizing trigger phrases, predicting classes with very few examples using trigger phrases and training a convolutional neural network (CNN) with word embeddings and Unified Medical Language System (UMLS) entity embeddings. Results We evaluated our method on the 2008 Integrating Informatics with Biology and the Bedside (i2b2) obesity challenge. The results demonstrate that our method outperforms the state-of-the-art methods. Conclusion We showed that CNN model is powerful for learning effective hidden features, and CUIs embeddings are helpful for building clinical text representations. This shows integrating domain knowledge into CNN models is promising.
Collapse
|
34
|
Karadeniz İ, Özgür A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinformatics 2019; 20:156. [PMID: 30917789 PMCID: PMC6437991 DOI: 10.1186/s12859-019-2678-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 02/13/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions. RESULTS We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data. CONCLUSIONS The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.
Collapse
Affiliation(s)
- İlknur Karadeniz
- Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey.
| |
Collapse
|
35
|
Abstract
Biomedical question answering (QA) is a challenging task that has not been yet successfully solved, according to results on international benchmarks, such as BioASQ. Recent progress on deep neural networks has led to promising results in domain independent QA, but the lack of large datasets with biomedical question-answer pairs hinders their successful application to the domain of biomedicine. We propose a novel machine-learning based answer processing approach that exploits neural networks in an unsupervised way through word embeddings. Our approach first combines biomedical and general purpose tools to identify the candidate answers from a set of passages. Candidates are then represented using a combination of features based on both biomedical external resources and input textual sources, including features based on word embeddings. Candidates are then ranked based on the score given at the output of a binary classification model, trained from candidates extracted from a small number of questions, related passages and correct answer triplets from the BioASQ challenge. Our experimental results show that the use of word embeddings, combined with other features, improves the performance of answer processing in biomedical question answering. In addition, our results show that the use of several annotators improves the identification of answers in passages. Finally, our approach has participated in the last two versions (2017, 2018) of the BioASQ challenge achieving competitive results.
Collapse
|
36
|
Workman TE, Shao Y, Divita G, Zeng-Treitler Q. An efficient prototype method to identify and correct misspellings in clinical text. BMC Res Notes 2019; 12:42. [PMID: 30658682 DOI: 10.1186/s13104-019-4073-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 01/11/2019] [Indexed: 11/17/2022] Open
Abstract
Objective Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications. Results In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types. Electronic supplementary material The online version of this article (10.1186/s13104-019-4073-y) contains supplementary material, which is available to authorized users.
Collapse
|
37
|
Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 2018; 87:12-20. [PMID: 30217670 DOI: 10.1016/j.jbi.2018.09.008] [Citation(s) in RCA: 132] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 07/18/2018] [Accepted: 09/10/2018] [Indexed: 10/28/2022]
Abstract
BACKGROUND Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. METHODS In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. RESULTS The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. CONCLUSION Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Naveed Afzal
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | | | - Liwei Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Paul Kingsbury
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, USA.
| |
Collapse
|
38
|
Kolyvakis P, Kalousis A, Smith B, Kiritsis D. Biomedical ontology alignment: an approach based on representation learning. J Biomed Semantics 2018; 9:21. [PMID: 30111369 PMCID: PMC6094585 DOI: 10.1186/s13326-018-0187-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Accepted: 07/16/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. RESULTS An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results. CONCLUSIONS Our proposed representation learning approach leverages terminological embeddings to capture semantic similarity. Our results provide evidence that the approach produces embeddings that are especially well tailored to the ontology matching task, demonstrating a novel pathway for the problem.
Collapse
Affiliation(s)
- Prodromos Kolyvakis
- École Polytechnique Fédérale de Lausanne (EPFL), Route Cantonale, Lausanne, 1015 Switzerland
| | - Alexandros Kalousis
- Business Informatics Department, University of Applied Sciences, HES-SO, Western Switzerland Carouge, Switzerland
| | - Barry Smith
- Department of Philosophy and Department of Biomedical Informatics, 104 Park Hall, University at Buffalo, Buffalo, 14260 NY USA
| | - Dimitris Kiritsis
- École Polytechnique Fédérale de Lausanne (EPFL), Route Cantonale, Lausanne, 1015 Switzerland
| |
Collapse
|
39
|
Tsakalidis A, Papadopoulos S, Voskaki R, Ioannidou K, Boididou C, Cristea AI, Liakata M, Kompatsiaris Y. Building and evaluating resources for sentiment analysis in the Greek language. LANG RESOUR EVAL 2018; 52:1021-1044. [PMID: 30930705 PMCID: PMC6411313 DOI: 10.1007/s10579-018-9420-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems arising when analyzing text in such an under-resourced language. We present and make publicly available a rich set of such resources, ranging from a manually annotated lexicon, to semi-supervised word embedding vectors and annotated datasets for different tasks. Our experiments using different algorithms and parameters on our resources show promising results over standard baselines; on average, we achieve a 24.9% relative improvement in F-score on the cross-domain sentiment analysis task when training the same algorithms with our resources, compared to training them on more traditional feature sources, such as n-grams. Importantly, while our resources were built with the primary focus on the cross-domain sentiment analysis task, they also show promising results in related tasks, such as emotion analysis and sarcasm detection.
Collapse
Affiliation(s)
- Adam Tsakalidis
- 1Department of Computer Science, University of Warwick, Coventry, UK.,The Alan Turing Institute, London, UK
| | | | | | - Kyriaki Ioannidou
- 4Laboratory of Translation and Natural Language Processing, Aristotle University of Thessaloniki, Thessaloníki, Greece
| | | | - Alexandra I Cristea
- 1Department of Computer Science, University of Warwick, Coventry, UK.,6Department of Computer Science, University of Durham, Durham, UK
| | - Maria Liakata
- 1Department of Computer Science, University of Warwick, Coventry, UK.,The Alan Turing Institute, London, UK
| | | |
Collapse
|
40
|
Abstract
Background Biomedical named entity recognition(BNER) is a crucial initial step of information extraction in biomedical domain. The task is typically modeled as a sequence labeling problem. Various machine learning algorithms, such as Conditional Random Fields (CRFs), have been successfully used for this task. However, these state-of-the-art BNER systems largely depend on hand-crafted features. Results We present a recurrent neural network (RNN) framework based on word embeddings and character representation. On top of the neural network architecture, we use a CRF layer to jointly decode labels for the whole sentence. In our approach, contextual information from both directions and long-range dependencies in the sequence, which is useful for this task, can be well modeled by bidirectional variation and long short-term memory (LSTM) unit, respectively. Although our models use word embeddings and character embeddings as the only features, the bidirectional LSTM-RNN (BLSTM-RNN) model achieves state-of-the-art performance — 86.55% F1 on BioCreative II gene mention (GM) corpus and 73.79% F1 on JNLPBA 2004 corpus. Conclusions Our neural network architecture can be successfully used for BNER without any manual feature engineering. Experimental results show that domain-specific pre-trained word embeddings and character-level representation can improve the performance of the LSTM-RNN models. On the GM corpus, we achieve comparable performance compared with other systems using complex hand-crafted features. Considering the JNLPBA corpus, our model achieves the best results, outperforming the previously top performing systems. The source code of our method is freely available under GPL at https://github.com/lvchen1989/BNER.
Collapse
Affiliation(s)
- Chen Lyu
- School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China
| | - Bo Chen
- Department of Chinese Language & Literature, Hubei University of Art & Science, Xiangyang, 24105, Hubei, China
| | - Yafeng Ren
- Guangdong Collaborative Innovation Center for Language Research & Services, Guangdong University of Foreign Studies, Guangzhou, 510420, Guangdong, China
| | - Donghong Ji
- School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
| |
Collapse
|
41
|
Kim S, Fiorini N, Wilbur WJ, Lu Z. Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J Biomed Inform 2017; 75:122-127. [PMID: 28986328 DOI: 10.1016/j.jbi.2017.09.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 09/01/2017] [Accepted: 09/30/2017] [Indexed: 11/28/2022]
Abstract
The main approach of traditional information retrieval (IR) is to examine how many words from a query appear in a document. A drawback of this approach, however, is that it may fail to detect relevant documents where no or only few words from a query are found. The semantic analysis methods such as LSA (latent semantic analysis) and LDA (latent Dirichlet allocation) have been proposed to address the issue, but their performance is not superior compared to common IR approaches. Here we present a query-document similarity measure motivated by the Word Mover's Distance. Unlike other similarity measures, the proposed method relies on neural word embeddings to compute the distance between words. This process helps identify related words when no direct matches are found between a query and a document. Our method is efficient and straightforward to implement. The experimental results on TREC Genomics data show that our approach outperforms the BM25 ranking function by an average of 12% in mean average precision. Furthermore, for a real-world dataset collected from the PubMed® search logs, we combine the semantic measure with BM25 using a learning to rank method, which leads to improved ranking scores by up to 25%. This experiment demonstrates that the proposed approach and BM25 nicely complement each other and together produce superior performance.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Nicolas Fiorini
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| |
Collapse
|
42
|
Abstract
Background Drug Package Leaflets (DPLs) provide information for patients on how to safely use medicines. Pharmaceutical companies are responsible for producing these documents. However, several studies have shown that patients usually have problems in understanding sections describing posology (dosage quantity and prescription), contraindications and adverse drug reactions. An ultimate goal of this work is to provide an automatic approach that helps these companies to write drug package leaflets in an easy-to-understand language. Natural language processing has become a powerful tool for improving patient care and advancing medicine because it leads to automatically process the large amount of unstructured information needed for patient care. However, to the best of our knowledge, no research has been done on the automatic simplification of drug package leaflets. In a previous work, we proposed to use domain terminological resources for gathering a set of synonyms for a given target term. A potential drawback of this approach is that it depends heavily on the existence of dictionaries, however these are not always available for any domain and language or if they exist, their coverage is very scarce. To overcome this limitation, we propose the use of word embeddings to identify the simplest synonym for a given term. Word embedding models represent each word in a corpus with a vector in a semantic space. Our approach is based on assumption that synonyms should have close vectors because they occur in similar contexts. Results In our evaluation, we used the corpus EasyDPL (Easy Drug Package Leaflets), a collection of 306 leaflets written in Spanish and manually annotated with 1400 adverse drug effects and their simplest synonyms. We focus on leaflets written in Spanish because it is the second most widely spoken language on the world, but as for the existence of terminological resources, the Spanish language is usually less prolific than the English language. Our experiments show an accuracy of 38.5% using word embeddings. Conclusions This work provides a promising approach to simplify DPLs without using terminological resources or parallel corpora. Moreover, it could be easily adapted to different domains and languages. However, more research efforts are needed to improve our approach based on word embedding because it does not overcome our previous work using dictionaries yet.
Collapse
Affiliation(s)
- Isabel Segura-Bedmar
- Computer Science Departament, Universidad Carlos III de Madrid, Avenida de la Universidad, 30, Madrid, Spain.
| | - Paloma Martínez
- Computer Science Departament, Universidad Carlos III de Madrid, Avenida de la Universidad, 30, Madrid, Spain
| |
Collapse
|
43
|
Tao C, Filannino M, Uzuner Ö. Prescription extraction using CRFs and word embeddings. J Biomed Inform 2017; 72:60-66. [PMID: 28684255 DOI: 10.1016/j.jbi.2017.07.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Revised: 06/23/2017] [Accepted: 07/03/2017] [Indexed: 11/25/2022]
Abstract
In medical practices, doctors detail patients' care plan via discharge summaries written in the form of unstructured free texts, which among the others contain medication names and prescription information. Extracting prescriptions from discharge summaries is challenging due to the way these documents are written. Handwritten rules and medical gazetteers have proven to be useful for this purpose but come with limitations on performance, scalability, and generalizability. We instead present a machine learning approach to extract and organize medication names and prescription information into individual entries. Our approach utilizes word embeddings and tackles the task in two extraction steps, both of which are treated as sequence labeling problems. When evaluated on the 2009 i2b2 Challenge official benchmark set, the proposed approach achieves a horizontal phrase-level F1-measure of 0.864, which to the best of our knowledge represents an improvement over the current state-of-the-art.
Collapse
Affiliation(s)
- Carson Tao
- Department of Information Science, State University of New York at Albany, NY, USA.
| | - Michele Filannino
- Department of Computer Science, State University of New York at Albany, NY, USA
| | - Özlem Uzuner
- Department of Computer Science, State University of New York at Albany, NY, USA
| |
Collapse
|
44
|
Abstract
This paper concerns the generation of distributed vector representations of biomedical concepts from structured knowledge, in the form of subject-relation-object triplets known as semantic predications. Specifically, we evaluate the extent to which a representational approach we have developed for this purpose previously, known as Predication-based Semantic Indexing (PSI), might benefit from insights gleaned from neural-probabilistic language models, which have enjoyed a surge in popularity in recent years as a means to generate distributed vector representations of terms from free text. To do so, we develop a novel neural-probabilistic approach to encoding predications, called Embedding of Semantic Predications (ESP), by adapting aspects of the Skipgram with Negative Sampling (SGNS) algorithm to this purpose. We compare ESP and PSI across a number of tasks including recovery of encoded information, estimation of semantic similarity and relatedness, and identification of potentially therapeutic and harmful relationships using both analogical retrieval and supervised learning. We find advantages for ESP in some, but not all of these tasks, revealing the contexts in which the additional computational work of neural-probabilistic modeling is justified.
Collapse
Affiliation(s)
- Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
| | | |
Collapse
|
45
|
Mehryary F, Kaewphan S, Hakala K, Ginter F. Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification. J Biomed Semantics 2016; 7:27. [PMID: 27175227 PMCID: PMC4864999 DOI: 10.1186/s13326-016-0070-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 05/01/2016] [Indexed: 11/19/2022] Open
Abstract
Background Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Methods Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. Results The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. Availability The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/. Electronic supplementary material The online version of this article (doi:10.1186/s13326-016-0070-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Farrokh Mehryary
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland ; Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Kai Hakala
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku, Turku, Finland
| |
Collapse
|
46
|
Nguyen NTH, Miwa M, Tsuruoka Y, Tojo S. Identifying synonymy between relational phrases using word embeddings. J Biomed Inform 2015; 56:94-102. [PMID: 26004792 DOI: 10.1016/j.jbi.2015.05.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Revised: 05/12/2015] [Accepted: 05/15/2015] [Indexed: 11/26/2022]
Abstract
Many text mining applications in the biomedical domain benefit from automatic clustering of relational phrases into synonymous groups, since it alleviates the problem of spurious mismatches caused by the diversity of natural language expressions. Most of the previous work that has addressed this task of synonymy resolution uses similarity metrics between relational phrases based on textual strings or dependency paths, which, for the most part, ignore the context around the relations. To overcome this shortcoming, we employ a word embedding technique to encode relational phrases. We then apply the k-means algorithm on top of the distributional representations to cluster the phrases. Our experimental results show that this approach outperforms state-of-the-art statistical models including latent Dirichlet allocation and Markov logic networks.
Collapse
Affiliation(s)
- Nhung T H Nguyen
- University of Science, Vietnam National University, Ho Chi Minh City, 227 Nguyen Van Cu St., Ward 4, Dist. 5, Ho Chi Minh City, Viet Nam; Japan Advanced Institute of Science and Technology, 1-8 Asahidai, Nomi-shi, Ishikawa 923-1292, Japan.
| | - Makoto Miwa
- Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya 468-8511, Japan.
| | | | - Satoshi Tojo
- Japan Advanced Institute of Science and Technology, 1-8 Asahidai, Nomi-shi, Ishikawa 923-1292, Japan.
| |
Collapse
|