Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020;16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open

For:	Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020;16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open

Number

Cited by Other Article(s)

Yamagiwa H, Hashimoto R, Arakane K, Murakami K, Soeda S, Oyama M, Zhu Y, Okada M, Shimodaira H. Predicting drug-gene relations via analogy tasks with word embeddings. Sci Rep 2025;15:17240. [PMID: 40383732 PMCID: PMC12086191 DOI: 10.1038/s41598-025-01418-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Accepted: 05/06/2025] [Indexed: 05/20/2025] Open

Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai PT, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025;16:3280. [PMID: 40188094 PMCID: PMC11972378 DOI: 10.1038/s41467-025-56989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2025] [Indexed: 04/07/2025] Open

Affiliation(s)

Qingyu Chen Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Xueqing Peng Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qianqian Xie Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qiao Jin National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Aidan Gilson Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Maxwell B Singer Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Xuguang Ai Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Po-Ting Lai National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Zhizheng Wang National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Vipina K Keloth Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Kalpana Raja Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jimin Huang Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Huan He Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Fongci Lin Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jingcheng Du McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Rui Zhang Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Ron A Adelman Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Hua Xu Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.

Collapse

Chen Y, Zou J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat Biomed Eng 2025;9:483-493. [PMID: 39643729 DOI: 10.1038/s41551-024-01284-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 10/16/2024] [Indexed: 12/09/2024]

Brown GS, Wengler J, Fabelico AJS, Muir A, Tubbs A, Warren A, Millett AN, Yu XX, Pavlidis P, Rogic S, Piccolo SR. Using semantic search to find publicly available gene-expression datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.13.643153. [PMID: 40161731 PMCID: PMC11952526 DOI: 10.1101/2025.03.13.643153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]

Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024;16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]

Kang H, Hou L, Gu Y, Lu X, Li J, Li Q. Drug-disease association prediction with literature based multi-feature fusion. Front Pharmacol 2023;14:1205144. [PMID: 37284317 PMCID: PMC10239876 DOI: 10.3389/fphar.2023.1205144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 05/09/2023] [Indexed: 06/08/2023] Open

Liu Y, Elsworth BL, Gaunt TR. Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets. Bioinformatics 2023;39:btad169. [PMID: 37010521 PMCID: PMC10097433 DOI: 10.1093/bioinformatics/btad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 02/12/2023] [Accepted: 03/19/2023] [Indexed: 04/04/2023] Open

Hussain MJ, Bai H, Wasti SH, Huang G, Jiang Y. Evaluating semantic similarity and relatedness between concepts by combining taxonomic and non-taxonomic semantic features of WordNet and Wikipedia. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]

Shtar G, Greenstein-Messica A, Mazuz E, Rokach L, Shapira B. Predicting drug characteristics using biomedical text embedding. BMC Bioinformatics 2022;23:526. [PMID: 36476573 PMCID: PMC9730627 DOI: 10.1186/s12859-022-05083-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 11/25/2022] [Indexed: 12/13/2022] Open

Abstract

BACKGROUND

Drug-drug interactions (DDIs) are preventable causes of medical injuries and often result in doctor and emergency room visits. Previous research demonstrates the effectiveness of using matrix completion approaches based on known drug interactions to predict unknown Drug-drug interactions. However, in the case of a new drug, where there is limited or no knowledge regarding the drug's existing interactions, such an approach is unsuitable, and other drug's preferences can be used to accurately predict new Drug-drug interactions.

METHODS

We propose adjacency biomedical text embedding (ABTE) to address this limitation by using a hybrid approach which combines known drugs' interactions and the drug's biomedical text embeddings to predict the DDIs of both new and well known drugs.

RESULTS

Our evaluation demonstrates the superiority of this approach compared to recently published DDI prediction models and matrix factorization-based approaches. Furthermore, we compared the use of different text embedding methods in ABTE, and found that the concept embedding approach, which involves biomedical information in the embedding process, provides the highest performance for this task. Additionally, we demonstrate the effectiveness of leveraging biomedical text embedding for additional drugs' biomedical prediction task by presenting text embedding's contribution to a multi-modal pregnancy drug safety classification.

CONCLUSION

Text and concept embeddings created by analyzing a domain-specific large-scale biomedical corpora can be used for predicting drug-related properties such as Drug-drug interactions and drug safety prediction. Prediction models based on the embeddings resulted in comparable results to hand-crafted features, however text embeddings do not require manual categorization or data collection and rely solely on the published literature.

Collapse

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022;17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open

Abstract

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.

Collapse

Turki H, Jemielniak D, Hadj Taieb MA, Labra Gayo JE, Ben Aouicha M, Banat M, Shafee T, Prud’hommeaux E, Lubiana T, Das D, Mietchen D. Using logical constraints to validate statistical information about disease outbreaks in collaborative knowledge graphs: the case of COVID-19 epidemiology in Wikidata. PeerJ Comput Sci 2022;8:e1085. [PMID: 36262159 PMCID: PMC9575845 DOI: 10.7717/peerj-cs.1085] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 08/15/2022] [Indexed: 06/16/2023]

Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]

Chen Q, Rankine A, Peng Y, Aghaarabi E, Lu Z. Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study. JMIR Med Inform 2021;9:e27386. [PMID: 34967748 PMCID: PMC8759018 DOI: 10.2196/27386] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 08/06/2021] [Accepted: 08/06/2021] [Indexed: 01/23/2023] Open

Abstract

Background

Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank.

Objective

Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications.

Methods

We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures.

Results

Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications.

Conclusions

Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.

Collapse

Alachram H, Chereda H, Beißbarth T, Wingender E, Stegmaier P. Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks. PLoS One 2021;16:e0258623. [PMID: 34653224 PMCID: PMC8519453 DOI: 10.1371/journal.pone.0258623] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 10/01/2021] [Indexed: 11/18/2022] Open

Abstract

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

Collapse

Holmgren SD, Boyles RR, Cronk RD, Duncan CG, Kwok RK, Lunn RM, Osborn KC, Thessen AE, Schmitt CP. Catalyzing Knowledge-Driven Discovery in Environmental Health Sciences through a Community-Driven Harmonized Language. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021;18:8985. [PMID: 34501574 PMCID: PMC8430534 DOI: 10.3390/ijerph18178985] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 08/13/2021] [Accepted: 08/19/2021] [Indexed: 01/10/2023]

Grissette H, Nfaoui EH. Affective Concept-Based Encoding of Patient Narratives via Sentic Computing and Neural Networks. Cognit Comput 2021;14:274-299. [PMID: 34422122 PMCID: PMC8371039 DOI: 10.1007/s12559-021-09903-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 06/23/2021] [Indexed: 11/30/2022]

Abstract

The automatic generation of features without human intervention is the most critical task for biomedical sentiment analysis. Regarding the high dynamicity of shared patient narrative data, the lack of formal medical language sentiment dictionaries prevents retrieval of the appropriate sentiment, which is unapproachable and can be prone to annotator bias. We propose a novel affective biomedical concept-based encoding via sentic computing and neural networks. The main contributions include four aspects. First, a biomedical embedding, in which a medical entity is defined, normalized, and synthesized from a text, is built using online patient narratives after being combined with label propagation from a widely used comprehensive biomedical vocabulary. Second, considering the dependence on biomedical definitions, drug reaction sample selection based on general matching is suggested. These feature settings are then used to build and recognize affective semantics and sentics based on an extreme learning machine. Finally, a semisupervised LSTM-BiLSTM model for biomedical sentiment analysis is constructed. There was a massive influx of patient self-reports related to the COVID-19 pandemic. A study was conducted in this direction, and we tested the validity, medical language familiarity, and transferability of our approach by analyzing millions of COVID-19 tweets. Comparisons to affective lexicons also indicate that integrating extreme learning machine cognitive capabilities has advantages over biomedical sentiment analysis. By considering sentics vectors on top of the formed embeddings, our semisupervised LSTM-BiLSTM achieved an accuracy of 87.5%. The evaluations of unsupervised learning approximated the results of the previous model when dealing with a serious loss of biomedical data. In this paper, we demonstrate the effectiveness of integrating deep-learning-based cognitive capabilities for both enhancing distributed biomedical definitions and inferring sentiment compositions from many patient self-reports on social networks. The relevant encoding of affective information conveyed regarding medication subjects clearly reveals defined roles and expectations that can have a positive impact on public health.

Collapse

Grissette H, Nfaoui EH. Deep associative learning approach for bio-medical sentiment analysis utilizing unsupervised representation from large-scale patients' narratives. PERSONAL AND UBIQUITOUS COMPUTING 2021;27:1-15. [PMID: 34393692 PMCID: PMC8355270 DOI: 10.1007/s00779-021-01595-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 06/29/2021] [Indexed: 06/13/2023]

Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform 2021;22:6326536. [PMID: 34308472 DOI: 10.1093/bib/bbab282] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/07/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open

Pfrieger FW. TeamTree analysis: A new approach to evaluate scientific production. PLoS One 2021;16:e0253847. [PMID: 34288914 PMCID: PMC8294527 DOI: 10.1371/journal.pone.0253847] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 06/14/2021] [Indexed: 11/18/2022] Open

Majewska O, Collins C, Baker S, Björne J, Brown SW, Korhonen A, Palmer M. BioVerbNet: a large semantic-syntactic classification of verbs in biomedicine. J Biomed Semantics 2021;12:12. [PMID: 34266499 PMCID: PMC8280585 DOI: 10.1186/s13326-021-00247-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 07/01/2021] [Indexed: 11/10/2022] Open

Abstract

Background

Recent advances in representation learning have enabled large strides in natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge have been shown to boost model performance in different Natural Language Processing (NLP) tasks where accurate handling of verb meaning and behaviour is critical. The costliness and time required for manual lexicon construction has been a major obstacle to porting the benefits of such resources to NLP in specialised domains, such as biomedicine. To address this issue, we combine a neural classification method with expert annotation to create BioVerbNet. This new resource comprises 693 verbs assigned to 22 top-level and 117 fine-grained semantic-syntactic verb classes. We make this resource available complete with semantic roles and VerbNet-style syntactic frames.

Results

We demonstrate the utility of the new resource in boosting model performance in document- and sentence-level classification in biomedicine. We apply an established retrofitting method to harness the verb class membership knowledge from BioVerbNet and transform a pretrained word embedding space by pulling together verbs belonging to the same semantic-syntactic class. The BioVerbNet knowledge-aware embeddings surpass the non-specialised baseline by a significant margin on both tasks.

Conclusion

This work introduces the first large, annotated semantic-syntactic classification of biomedical verbs, providing a detailed account of the annotation process, the key differences in verb behaviour between the general and biomedical domain, and the design choices made to accurately capture the meaning and properties of verbs used in biomedical texts. The demonstrated benefits of leveraging BioVerbNet in text classification suggest the resource could help systems better tackle challenging NLP tasks in biomedicine.

Collapse

Liu Z, Roberts RA, Lal-Nag M, Chen X, Huang R, Tong W. AI-based language models powering drug discovery and development. Drug Discov Today 2021;26:2593-2607. [PMID: 34216835 PMCID: PMC8604259 DOI: 10.1016/j.drudis.2021.06.009] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 04/28/2021] [Accepted: 06/25/2021] [Indexed: 02/08/2023]

Yum Y, Lee JM, Jang MJ, Kim Y, Kim JH, Kim S, Shin U, Song S, Joo HJ. A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation. JMIR Med Inform 2021;9:e29667. [PMID: 34185005 PMCID: PMC8277378 DOI: 10.2196/29667] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 05/08/2021] [Accepted: 05/16/2021] [Indexed: 01/16/2023] Open

Abstract

Background

The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences.

Objective

We propose a new Korean word pair reference set to verify embedding models.

Methods

From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs.

Results

The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30).

Conclusions

Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.

Collapse

Newman-Griffis D, Sivaraman V, Perer A, Fosler-Lussier E, Hochheiser H. TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora. PROCEEDINGS OF THE CONFERENCE. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. NORTH AMERICAN CHAPTER. MEETING 2021;2021:106-115. [PMID: 34151319 PMCID: PMC8212692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021;16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Collapse

Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res 2021;49:D1534-D1540. [PMID: 33166392 PMCID: PMC7778958 DOI: 10.1093/nar/gkaa952] [Citation(s) in RCA: 144] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 10/02/2020] [Accepted: 10/08/2020] [Indexed: 12/22/2022] Open

Joachimiak MP. Zinc against COVID-19? Symptom surveillance and deficiency risk groups. PLoS Negl Trop Dis 2021;15:e0008895. [PMID: 33395417 PMCID: PMC7781367 DOI: 10.1371/journal.pntd.0008895] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Abstract

A wide variety of symptoms is associated with Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection, and these symptoms can overlap with other conditions and diseases. Knowing the distribution of symptoms across diseases and individuals can support clinical actions on timelines shorter than those for drug and vaccine development. Here, we focus on zinc deficiency symptoms, symptom overlap with other conditions, as well as zinc effects on immune health and mechanistic zinc deficiency risk groups. There are well-studied beneficial effects of zinc on the immune system including a decreased susceptibility to and improved clinical outcomes for infectious pathogens including multiple viruses. Zinc is also an anti-inflammatory and anti-oxidative stress agent, relevant to some severe Coronavirus Disease 2019 (COVID-19) symptoms. Unfortunately, zinc deficiency is common worldwide and not exclusive to the developing world. Lifestyle choices and preexisting conditions alone can result in zinc deficiency, and we compile zinc risk groups based on a review of the literature. It is also important to distinguish chronic zinc deficiency from deficiency acquired upon viral infection and immune response and their different supplementation strategies. Zinc is being considered as prophylactic or adjunct therapy for COVID-19, with 12 clinical trials underway, highlighting the relevance of this trace element for global pandemics. Using the example of zinc, we show that there is a critical need for a deeper understanding of essential trace elements in human health, and the resulting deficiency symptoms and their overlap with other conditions. This knowledge will directly support human immune health for decreasing susceptibility, shortening illness duration, and preventing progression to severe cases in the current and future pandemics.

Collapse

Yeganova L, Kim S, Chen Q, Balasanov G, Wilbur WJ, Lu Z. Better synonyms for enriching biomedical search. J Am Med Inform Assoc 2020;27:1894-1902. [PMID: 33083825 PMCID: PMC7727334 DOI: 10.1093/jamia/ocaa151] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 05/20/2020] [Accepted: 08/20/2020] [Indexed: 01/12/2023] Open

Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020's. Genes (Basel) 2020;11:E1264. [PMID: 33120976 PMCID: PMC7692357 DOI: 10.3390/genes11111264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 02/06/2023] Open