1
|
Klein AZ, Weissenbacher D, O’Connor K, Elyaderani A, Amaro IF, Onishi T, Golder S, Spiegel K, Scotch M, Gonzalez-Hernandez G. Detection of patient metadata in published articles for genomic epidemiology using machine learning and large language models. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.25.25326298. [PMID: 40343027 PMCID: PMC12060954 DOI: 10.1101/2025.04.25.25326298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/11/2025]
Abstract
Objective Patient metadata exist in published articles, but are often disconnected from genome sequences in databases, limiting their utility for genomic epidemiology. The objective of this study was to develop and evaluate natural language processing methods to facilitate the large-scale detection of patient metadata associated with reports of genome sequencing in published articles, drawing on the case of SARS-CoV-2. Methods We applied filters to select a sample of 245 PubMed articles (50,918 sentences) in LitCovid for manual annotation of sentences that reported generating SARS-CoV-2 sequences. We trained, deployed, and validated a BERT-based classifier, and selected a sample of 150 predicted articles (22,147 sentences) for manual annotation of sentences that reported patient metadata associated with the sequences. In addition to training BERT-based classifiers, we experimented with a generative AI approach, prompting the Llama-3-70B LLM using zero-shot, role-based, few-shot, chain-of-thought, and reasoning-eliciting prompting. Results BERT-based models that were pre-trained on corpora in biomedical or, more specifically, COVID-19 domains outperformed those that were pre-trained on corpora in general domains for detecting reports of patient metadata associated with SARS-CoV-2 sequences, achieving the best performance with a classifier based on a BiomedBERT-Large-Abstract model (F1-score = 0.776). While the best performance of our generative AI approach was achieved using role-based, few-shot, and chain-of-thought prompting (F1-score = 0.558), it was nonetheless outperformed by all of our machine learning-based classifiers. Conclusion Our methods were applied to more than 350,000 published articles and can be used to advance the utility and efficiency of genomic epidemiology for public health responses to virus outbreaks.
Collapse
Affiliation(s)
- Ari Z. Klein
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Karen O’Connor
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Amir Elyaderani
- College of Health Solutions, Arizona State University, Phoenix, AZ, USA
| | - Ivan Flores Amaro
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Takeshi Onishi
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Su Golder
- Department of Health Sciences, University of York, York, UK
| | - Kaelen Spiegel
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Phoenix, AZ, USA
| | | |
Collapse
|
2
|
Xie J, Zhang Z, Zeng S, Hilliard J, An G, Tang X, Jiang L, Yu Y, Wan X, Xu D. Leveraging Large Language Models for Infectious Disease Surveillance-Using a Web Service for Monitoring COVID-19 Patterns From Self-Reporting Tweets: Content Analysis. J Med Internet Res 2025; 27:e63190. [PMID: 39977859 PMCID: PMC11888100 DOI: 10.2196/63190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 12/09/2024] [Accepted: 01/16/2025] [Indexed: 02/22/2025] Open
Abstract
BACKGROUND The emergence of new SARS-CoV-2 variants, the resulting reinfections, and post-COVID-19 condition continue to impact many people's lives. Tracking websites like the one at Johns Hopkins University no longer report the daily confirmed cases, posing challenges to accurately determine the true extent of infections. Many COVID-19 cases with mild symptoms are self-assessed at home and reported on social media, which provides an opportunity to monitor and understand the progression and evolving trends of the disease. OBJECTIVE We aim to build a publicly available database of COVID-19-related tweets and extracted information about symptoms and recovery cycles from self-reported tweets. We have presented the results of our analysis of infection, reinfection, recovery, and long-term effects of COVID-19 on a visualization website that refreshes data on a weekly basis. METHODS We used Twitter (subsequently rebranded as X) to collect COVID-19-related data, from which 9 native English-speaking annotators annotated a training dataset of COVID-19-positive self-reporters. We then used large language models to identify positive self-reporters from other unannotated tweets. We used the Hibert transform to calculate the lead of the prediction curve ahead of the reported curve. Finally, we presented our findings on symptoms, recovery, reinfections, and long-term effects of COVID-19 on the Covlab website. RESULTS We collected 7.3 million tweets related to COVID-19 between January 1, 2020, and April 1, 2024, including 262,278 self-reported cases. The predicted number of infection cases by our model is 7.63 days ahead of the official report. In addition to common symptoms, we identified some symptoms that were not included in the list from the US Centers for Disease Control and Prevention, such as lethargy and hallucinations. Repeat infections were commonly occurring, with rates of second and third infections at 7.49% (19,644/262,278) and 1.37% (3593/262,278), respectively, whereas 0.45% (1180/262,278) also reported that they had been infected >5 times. We identified 723 individuals who shared detailed recovery experiences through tweets, indicating a substantially reduction in recovery time over the years. Specifically, the average recovery period decreased from around 30 days in 2020 to approximately 12 days in 2023. In addition, geographic information collected from confirmed individuals indicates that the temporal patterns of confirmed cases in states such as California and Texas closely mirror the overall trajectory observed across the United States. CONCLUSIONS Although with some biases and limitations, self-reported tweet data serves as a valuable complement to clinical data, especially in the postpandemic era dominated by mild cases. Our web-based analytic platform can play a significant role in continuously tracking COVID-19, finding new uncommon symptoms, detecting and monitoring the manifestation of long-term effects, and providing necessary insights to the public and decision-makers.
Collapse
Affiliation(s)
- Jiacheng Xie
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Ziyang Zhang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Joel Hilliard
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| | - Guanghui An
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- School of Acupuncture-Moxibustion and Tuina, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Xiaoting Tang
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Shanghai Pudong New Area Wanggang Community Health Service Center, Shanghai, China
| | - Lei Jiang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Yang Yu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Xiufeng Wan
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- NextGen Center for Influenza and Emerging Infectious Diseases, University of Missouri, Columbia, United States
- Department of Molecular Microbiology and Immunology, University of Missouri, Columbia, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
3
|
Zhou X, Zhou J, Wang C, Xie Q, Ding K, Mao C, Liu Y, Cao Z, Chu H, Chen X, Xu H, Larson HJ, Luo Y. PH-LLM: Public Health Large Language Models for Infoveillance. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.02.08.25321587. [PMID: 39990576 PMCID: PMC11844576 DOI: 10.1101/2025.02.08.25321587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Background The effectiveness of public health intervention, such as vaccination and social distancing, relies on public support and adherence. Social media has emerged as a critical platform for understanding and fostering public engagement with health interventions. However, the lack of real-time surveillance on public health issues leveraging social media data, particularly during public health emergencies, leads to delayed responses and suboptimal policy adjustments. Methods To address this gap, we developed PH-LLM (Public Health Large Language Models for Infoveillance)-a novel suite of large language models (LLMs) specifically designed for real-time public health monitoring. We curated a multilingual training corpus comprising 593,100 instruction-output pairs from 36 datasets, covering 96 public health infoveillance tasks and 6 question-answering datasets based on social media data. PH-LLM was trained using quantized low-rank adapters (QLoRA) and LoRA plus, leveraging Qwen 2.5, which supports 29 languages. The PH-LLM suite includes models of six different sizes: 0.5B, 1.5B, 3B, 7B, 14B, and 32B. To evaluate PH-LLM, we constructed a benchmark comprising 19 English and 20 multilingual public health tasks using 10 social media datasets (totaling 52,158 unseen instruction-output pairs). We compared PH-LLM's performance against leading open-source models, including Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407, and Qwen2.5-72B-Instruct, as well as proprietary models such as GPT-4o. Findings Across 19 English and 20 multilingual evaluation tasks, PH-LLM consistently outperformed baseline models of similar and larger sizes, including instruction-tuned versions of Qwen2.5, Llama3.1/3.2, Mistral, and bloomz, with PH-LLM-32B achieving the state-of-the-art results. Notably, PH-LLM-14B and PH-LLM-32B surpassed Qwen2.5-72B-Instruct, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407, and GPT-4o in both English tasks (>=56.0% vs. <=52.3%) and multilingual tasks (>=59.6% vs. <= 59.1%). The only exception was PH-LLM-7B, with slightly suboptimal average performance (48.7%) in English tasks compared to Qwen2.5-7B-Instruct (50.7%), although it outperformed GPT-4o mini (46.9%), Mistral-Small-Instruct-2409 (45.8%), Llama-3.1-8B-Instruct (45.4%), and bloomz-7b1-mt (27.9%). Interpretation PH-LLM represents a significant advancement in real-time public health infoveillance, offering state-of-the-art multilingual capabilities and cost-effective solutions for monitoring public sentiment on health issues. By equipping global, national, and local public health agencies with timely insights from social media data, PH-LLM has the potential to enhance rapid response strategies, improve policy-making, and strengthen public health communication during crises and beyond. Funding This study is supported in part by NIH grants R01LM013337 (YL).
Collapse
Affiliation(s)
- Xinyu Zhou
- Division of Biostatistics and Informatics, Department of Preventive Medicine, Northwestern University, Chicago, IL 60611, USA
- Health Science Integrated PhD Program, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Jiaqi Zhou
- Division of Biostatistics and Informatics, Department of Preventive Medicine, Northwestern University, Chicago, IL 60611, USA
- Health Science Integrated PhD Program, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Chiyu Wang
- Department of Computer Science, Yale University, New Haven, CT 06511, USA
| | - Qianqian Xie
- Department of Biomedical Informatics & Data Science, Yale School of Medicine, CT 06510, USA
| | - Kaize Ding
- Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208, USA
| | - Chengsheng Mao
- Division of Biostatistics and Informatics, Department of Preventive Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Yuntian Liu
- Department of Biomedical Informatics & Data Science, Yale School of Medicine, CT 06510, USA
| | - Zhiyuan Cao
- Department of Biomedical Informatics & Data Science, Yale School of Medicine, CT 06510, USA
| | - Huangrui Chu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| | - Xi Chen
- Department of Health Policy and Management, Yale School of Public Health, New Haven, CT 06510, USA
- Department of Economics, Yale University, New Haven, CT 06511, USA
| | - Hua Xu
- Department of Biomedical Informatics & Data Science, Yale School of Medicine, CT 06510, USA
| | - Heidi J. Larson
- Department of Infectious Disease Dynamics, London School of Hygiene and Tropical Medicine, London W1E 7HT, UK
- Institute for Health Metrics and Evaluation, University of Washington, Seattle, WA 98195, USA
| | - Yuan Luo
- Division of Biostatistics and Informatics, Department of Preventive Medicine, Northwestern University, Chicago, IL 60611, USA
- Center for Collaborative AI in Healthcare, Institute for AI in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| |
Collapse
|
4
|
Lin S, Garay L, Hua Y, Guo Z, Li W, Li M, Zhang Y, Xu X, Yang J. Analysis of longitudinal social media for monitoring symptoms during a pandemic. J Biomed Inform 2025; 162:104778. [PMID: 39832606 DOI: 10.1016/j.jbi.2025.104778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 12/28/2024] [Accepted: 01/15/2025] [Indexed: 01/22/2025]
Abstract
OBJECTIVE Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic. MATERIALS AND METHODS This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User-experienced Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence. RESULTS This study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39 % (832,287) of symptom-sharing tweets reflected user-experienced mentions. The trained USM model achieved an average F1 score of 0.927. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and Wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the Wild-type strain and Delta variant. CONCLUSION This study established a robust framework for analyzing longitudinal social media data to monitor symptoms during a pandemic. By integrating denoising of user-experienced symptom mentions, our findings reveal the duration of different symptoms over time and by variant within a cohort of nearly 200,000 patients, providing critical insights into symptom trends that are often difficult to capture through traditional data source.
Collapse
Affiliation(s)
- Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Lucas Garay
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Yining Hua
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Zhijiang Guo
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Wanxin Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Yujie Zhang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Xiaolin Xu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China; Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
5
|
Wen J, Xue H, Rush E, Panickan VA, Cai T, Zhou D, Ho YL, Costa L, Begoli E, Hong C, Gaziano JM, Cho K, Liao KP, Lu J, Cai T. DOME: Directional medical embedding vectors from Electronic Health Records. J Biomed Inform 2025; 162:104768. [PMID: 39755324 PMCID: PMC12040072 DOI: 10.1016/j.jbi.2024.104768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 12/07/2024] [Accepted: 12/26/2024] [Indexed: 01/06/2025]
Abstract
MOTIVATION The increasing availability of Electronic Health Record (EHR) systems has created enormous potential for translational research. Recent developments in representation learning techniques have led to effective large-scale representations of EHR concepts along with knowledge graphs that empower downstream EHR studies. However, most existing methods require training with patient-level data, limiting their abilities to expand the training with multi-institutional EHR data. On the other hand, scalable approaches that only require summary-level data do not incorporate temporal dependencies between concepts. METHODS We introduce a DirectiOnal Medical Embedding (DOME) algorithm to encode temporally directional relationships between medical concepts, using summary-level EHR data. Specifically, DOME first aggregates patient-level EHR data into an asymmetric co-occurrence matrix. Then it computes two Positive Pointwise Mutual Information (PPMI) matrices to correspondingly encode the pairwise prior and posterior dependencies between medical concepts. Following that, a joint matrix factorization is performed on the two PPMI matrices, which results in three vectors for each concept: a semantic embedding and two directional context embeddings. They collectively provide a comprehensive depiction of the temporal relationship between EHR concepts. RESULTS We highlight the advantages and translational potential of DOME through three sets of validation studies. First, DOME consistently improves existing direction-agnostic embedding vectors for disease risk prediction in several diseases, for example achieving a relative gain of 5.5% in the area under the receiver operating characteristic (AUROC) for lung cancer. Second, DOME excels in directional drug-disease relationship inference by successfully differentiating between drug side effects and indications, correspondingly achieving relative AUROC gain over the state-of-the-art methods by 10.8% and 6.6%. Finally, DOME effectively constructs directional knowledge graphs, which distinguish disease risk factors from comorbidities, thereby revealing disease progression trajectories. The source codes are provided at https://github.com/celehs/Directional-EHR-embedding.
Collapse
Affiliation(s)
- Jun Wen
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA
| | - Hao Xue
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Everett Rush
- Department of Energy, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Vidul A Panickan
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA
| | - Tianrun Cai
- VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | - Doudou Zhou
- Department of Statistics and Data Science, National University of Singapore, Singapore
| | - Yuk-Lam Ho
- VA Boston Healthcare System, Boston, MA, USA
| | | | - Edmon Begoli
- Department of Energy, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | | | - J Michael Gaziano
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | - Kelly Cho
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | - Katherine P Liao
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Brigham and Women's Hospital, Boston, MA, USA
| | - Junwei Lu
- VA Boston Healthcare System, Boston, MA, USA; Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Tianxi Cai
- Harvard Medical School, Boston, MA, USA; VA Boston Healthcare System, Boston, MA, USA; Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
6
|
Chai H, Cui J, Tang S, Ding Y, Liu X, Fang B, Liao Q. MG-SIN: Multigraph Sparse Interaction Network for Multitask Stance Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3111-3125. [PMID: 37956011 DOI: 10.1109/tnnls.2023.3328659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Stance detection on social media aims to identify if an individual is in support of or against a specific target. Most existing stance detection approaches primarily rely on modeling the contextual semantic information in sentences and neglect to explore the pragmatics dependency information of words, thus degrading performance. Although several single-task learning methods have been proposed to capture richer semantic representation information, they still suffer from semantic sparsity problems caused by short texts on social media. This article proposes a novel multigraph sparse interaction network (MG-SIN) by using multitask learning (MTL) to identify the stances and classify the sentiment polarities of tweets simultaneously. Our basic idea is to explore the pragmatics dependency relationship between tasks at the word level by constructing two types of heterogeneous graphs, including task-specific and task-related graphs (tr-graphs), to boost the learning of task-specific representations. A graph-aware module is proposed to adaptively facilitate information sharing between tasks via a novel sparse interaction mechanism among heterogeneous graphs. Through experiments on two real-world datasets, compared with the state-of-the-art baselines, the extensive results exhibit that MG-SIN achieves competitive improvements of up to 2.1% and 2.42% for the stance detection task, and 5.26% and 3.93% for the sentiment analysis task, respectively.
Collapse
|
7
|
Li X, Peng L, Wang YP, Zhang W. Open challenges and opportunities in federated foundation models towards biomedical healthcare. BioData Min 2025; 18:2. [PMID: 39755653 DOI: 10.1186/s13040-024-00414-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Accepted: 12/09/2024] [Indexed: 01/06/2025] Open
Abstract
This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) in biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions. The incorporation of FL with these sophisticated models presents a promising strategy to harness their analytical power while safeguarding the privacy of sensitive medical data. This approach not only enhances the capabilities of FMs in medical diagnostics and personalized treatment but also addresses critical concerns about data privacy and security in healthcare. This survey reviews the current applications of FMs in federated settings, underscores the challenges, and identifies future research directions including scaling FMs, managing data diversity, and enhancing communication efficiency within FL frameworks. The objective is to encourage further research into the combined potential of FMs and FL, laying the groundwork for healthcare innovations.
Collapse
Affiliation(s)
- Xingyu Li
- Department of Computer Science, Tulane University, New Orleans, LA, USA
| | - Lu Peng
- Department of Computer Science, Tulane University, New Orleans, LA, USA.
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, USA
| | - Weihua Zhang
- School of Computer Science, Fudan University, Shanghai, China
| |
Collapse
|
8
|
Saleh SN. Enhancing multilabel classification for unbalanced COVID-19 vaccination hesitancy tweets using ensemble learning. Comput Biol Med 2025; 184:109437. [PMID: 39581124 DOI: 10.1016/j.compbiomed.2024.109437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 11/12/2024] [Accepted: 11/12/2024] [Indexed: 11/26/2024]
Abstract
PURPOSE Vaccination plays a crucial role in public health strategies, particularly in combating contagious diseases like COVID-19. Despite developing and authorizing numerous vaccines, achieving widespread immunity remains a challenge. Understanding individuals' willingness to receive vaccines is essential, and recent studies have explored acceptance rates across various regions. However, vaccine hesitancy is still considered a significant global health concern, even after the pandemic's peak. METHODS This article explores the topic of classification in social media networks, specifically in tweets, focusing on the challenges of class imbalance and multilabel multiclass tweet classification. It investigates the effectiveness of oversampling in addressing unbalanced datasets and proposes an ensemble-based framework comprising various BERT models fine-tuned on diverse text corpora. RESULTS The proposed model was applied to a benchmark dataset containing 12 classes with a highly imbalanced distribution. Compared to recent literature, the model achieved significant improvements in performance metrics. We observed a 2% increase in micro, macro, and weighted average F1 scores, a 4% increase in accuracy, and a 3% increase in average Jaccard index when using the proposed framework on this unbalanced multi-labeled benchmark. CONCLUSION The study investigated fine-tuning BERT models for improved classification in tweets on unbalanced datasets related to public health. The proposed ensemble approach with oversampling techniques yielded significant improvements in multiple metrics compared to existing methods. This suggests a promising approach to analyzing vaccine discourse on social media and potentially aiding public health efforts.
Collapse
Affiliation(s)
- Sherine Nagy Saleh
- Computer Engineering Department, Arab Academy for Science, Technology and Maritime Transport, Abukir, Alexandria, 1029, Egypt.
| |
Collapse
|
9
|
Wang X, Lin Z, Shi J, Sun Y. Opinion Leadership and Sharing Positive and Negative Information About Vaccines on Social Media: A Mixed-Methods Approach. JOURNAL OF HEALTH COMMUNICATION 2024; 29:693-701. [PMID: 39557511 DOI: 10.1080/10810730.2024.2426810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2024]
Abstract
In our research, we examined how three dimensions of opinion leadership-connectivity, maven, and persuasiveness-are associated with sharing positive and negative information about vaccines among parents in Hong Kong through a mixed-methods approach. In two studies, we assessed opinion leadership following a sociometric approach that involved using data from social media (Study 1) and a self-assessment using survey data (Study 2), which yielded largely consistent results. In particular, whereas connectivity and maven were significantly associated with sharing positive information about vaccines, all three dimensions were significantly associated with sharing negative information about vaccines. Those findings suggest that different dimensions of opinion leadership play different roles in information sharing depending on the information's valence. Moreover, the similar pattern of findings from both studies suggested that the sociometric approach and self-assessment may capture the multidimensional nature of opinion leadership equally well. In sum, the findings advance theoretical discussions on the role of opinion leadership in information sharing and offer practical insights into promoting vaccination for children among parents.
Collapse
Affiliation(s)
- Xiaohui Wang
- Department of Media and Communication, City University of Hong Kong, Hong Kong, China
| | - Zhihuai Lin
- Hussman School of Journalism and Media, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jingyuan Shi
- Department of Interactive Media, Hong Kong Baptist University, Hong Kong, China
| | - Ye Sun
- Department of Media and Communication, City University of Hong Kong, Hong Kong, China
| |
Collapse
|
10
|
Lu Y, Aleta A, Du C, Shi L, Moreno Y. LLMs and generative agent-based models for complex systems research. Phys Life Rev 2024; 51:283-293. [PMID: 39486377 DOI: 10.1016/j.plrev.2024.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Accepted: 10/23/2024] [Indexed: 11/04/2024]
Abstract
The advent of Large Language Models (LLMs) offers to transform research across natural and social sciences, offering new paradigms for understanding complex systems. In particular, Generative Agent-Based Models (GABMs), which integrate LLMs to simulate human behavior, have attracted increasing public attention due to their potential to model complex interactions in a wide range of artificial environments. This paper briefly reviews the disruptive role LLMs are playing in fields such as network science, evolutionary game theory, social dynamics, and epidemic modeling. We assess recent advancements, including the use of LLMs for predicting social behavior, enhancing cooperation in game theory, and modeling disease propagation. The findings demonstrate that LLMs can reproduce human-like behaviors, such as fairness, cooperation, and social norm adherence, while also introducing unique advantages such as cost efficiency, scalability, and ethical simplification. However, the results reveal inconsistencies in their behavior tied to prompt sensitivity, hallucinations and even the model characteristics, pointing to challenges in controlling these AI-driven agents. Despite their potential, the effective integration of LLMs into decision-making processes -whether in government, societal, or individual contexts- requires addressing biases, prompt design challenges, and understanding the dynamics of human-machine interactions. Future research must refine these models, standardize methodologies, and explore the emergence of new cooperative behaviors as LLMs increasingly interact with humans and each other, potentially transforming how decisions are made across various systems.
Collapse
Affiliation(s)
- Yikang Lu
- School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, 650221, China
| | - Alberto Aleta
- Institute for Biocomputation and Physics of Complex Systems, University of Zaragoza, Zaragoza, 50018, Spain; Department of Theoretical Physics, University of Zaragoza, Zaragoza, 50009, Spain
| | - Chunpeng Du
- School of Mathematics, Kunming University, Kunming, Yunnan 650214, China
| | - Lei Shi
- School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, 650221, China; School of Statistics and Mathematics, Shanghai Lixin University of Accounting and Finance, Shanghai, 201209, China.
| | - Yamir Moreno
- Institute for Biocomputation and Physics of Complex Systems, University of Zaragoza, Zaragoza, 50018, Spain; Department of Theoretical Physics, University of Zaragoza, Zaragoza, 50009, Spain; Centai Institute, Turin, Italy.
| |
Collapse
|
11
|
Cho HN, Jun TJ, Kim YH, Kang H, Ahn I, Gwon H, Kim Y, Seo J, Choi H, Kim M, Han J, Kee G, Park S, Ko S. Task-Specific Transformer-Based Language Models in Health Care: Scoping Review. JMIR Med Inform 2024; 12:e49724. [PMID: 39556827 PMCID: PMC11612605 DOI: 10.2196/49724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/10/2023] [Accepted: 10/21/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Transformer-based language models have shown great potential to revolutionize health care by advancing clinical decision support, patient interaction, and disease prediction. However, despite their rapid development, the implementation of transformer-based language models in health care settings remains limited. This is partly due to the lack of a comprehensive review, which hinders a systematic understanding of their applications and limitations. Without clear guidelines and consolidated information, both researchers and physicians face difficulties in using these models effectively, resulting in inefficient research efforts and slow integration into clinical workflows. OBJECTIVE This scoping review addresses this gap by examining studies on medical transformer-based language models and categorizing them into 6 tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition. METHODS We conducted a scoping review following the Cochrane scoping review protocol. A comprehensive literature search was performed across databases, including Google Scholar and PubMed, covering publications from January 2017 to September 2024. Studies involving transformer-derived models in medical tasks were included. Data were categorized into 6 key tasks. RESULTS Our key findings revealed both advancements and critical challenges in applying transformer-based models to health care tasks. For example, models like MedPIR involving dialogue generation show promise but face privacy and ethical concerns, while question-answering models like BioBERT improve accuracy but struggle with the complexity of medical terminology. The BioBERTSum summarization model aids clinicians by condensing medical texts but needs better handling of long sequences. CONCLUSIONS This review attempted to provide a consolidated understanding of the role of transformer-based language models in health care and to guide future research directions. By addressing current challenges and exploring the potential for real-world applications, we envision significant improvements in health care informatics. Addressing the identified challenges and implementing proposed solutions can enable transformer-based language models to significantly improve health care delivery and patient outcomes. Our review provides valuable insights for future research and practical applications, setting the stage for transformative advancements in medical informatics.
Collapse
Affiliation(s)
- Ha Na Cho
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejun Kang
- Division of Cardiology, Asan Medical Center, Seoul, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Hansle Gwon
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiahn Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiye Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Soyoung Ko
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| |
Collapse
|
12
|
He Z, Bauch CT. Effect of homophily on coupled behavior-disease dynamics near a tipping point. Math Biosci 2024; 376:109264. [PMID: 39097225 DOI: 10.1016/j.mbs.2024.109264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 06/18/2024] [Accepted: 07/26/2024] [Indexed: 08/05/2024]
Abstract
Understanding the interplay between social activities and disease dynamics is crucial for effective public health interventions. Recent studies using coupled behavior-disease models assumed homogeneous populations. However, heterogeneity in population, such as different social groups, cannot be ignored. In this study, we divided the population into social media users and non-users, and investigated the impact of homophily (the tendency for individuals to associate with others similar to themselves) and online events on disease dynamics. Our results reveal that homophily hinders the adoption of vaccinating strategies, hastening the approach to a tipping point after which the population converges to an endemic equilibrium with no vaccine uptake. Furthermore, we find that online events can significantly influence disease dynamics, with early discussions on social media platforms serving as an early warning signal of potential disease outbreaks. Our model provides insights into the mechanisms underlying these phenomena and underscores the importance of considering homophily in disease modeling and public health strategies.
Collapse
Affiliation(s)
- Zitao He
- Department of Applied Mathematics, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
| | - Chris T Bauch
- Department of Applied Mathematics, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
13
|
Zhang Z, Hua Y, Zhou P, Lin S, Li M, Zhang Y, Zhou L, Liao Y, Yang J. Sexual and Gender-Diverse Individuals Face More Health Challenges during COVID-19: A Large-Scale Social Media Analysis with Natural Language Processing. HEALTH DATA SCIENCE 2024; 4:0127. [PMID: 39247070 PMCID: PMC11378377 DOI: 10.34133/hds.0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 07/25/2024] [Indexed: 09/10/2024]
Abstract
Background: The COVID-19 pandemic has caused a disproportionate impact on the sexual and gender-diverse (SGD) community. Compared with non-SGD populations, their social relations and health status are more vulnerable, whereas public health data regarding SGD are scarce. Methods: To analyze the concerns and health status of SGD individuals, this cohort study leveraged 471,371,477 tweets from 251,455 SGD and 22,644,411 non-SGD users, spanning from 2020 February 1 to 2022 April 30. The outcome measures comprised the distribution and dynamics of COVID-related topics, attitudes toward vaccines, and the prevalence of symptoms. Results: Topic analysis revealed that SGD users engaged more frequently in discussions related to "friends and family" (20.5% vs. 13.1%, P < 0.001) and "wear masks" (10.1% vs. 8.3%, P < 0.001) compared to non-SGD users. Additionally, SGD users exhibited a marked higher proportion of positive sentiment in tweets about vaccines, including Moderna, Pfizer, AstraZeneca, and Johnson & Johnson. Among 102,464 users who self-reported COVID-19 diagnoses, SGD users disclosed significantly higher frequencies of mentioning 61 out of 69 COVID-related symptoms than non-SGD users, encompassing both physical and mental health challenges. Conclusion: The results provide insights into an understanding of the unique needs and experiences of the SGD community during the pandemic, emphasizing the value of social media data in epidemiological and public health research.
Collapse
Affiliation(s)
- Zhiyun Zhang
- Department of Big Data in Health Science School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Yining Hua
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Peilin Zhou
- Thrust of Data Science and Analytics, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
| | - Shixu Lin
- Department of Big Data in Health Science School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Minghui Li
- Department of Big Data in Health Science School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Yujie Zhang
- Department of Big Data in Health Science School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Yanhui Liao
- Department of Psychiatry, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jie Yang
- Department of Big Data in Health Science School of Public Health, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
14
|
Dhiman A, Yom-Tov E, Pellis L, Edelstein M, Pebody R, Hayward A, House T, Finnie T, Guzman D, Lampos V, Cox IJ. Estimating the household secondary attack rate and serial interval of COVID-19 using social media. NPJ Digit Med 2024; 7:194. [PMID: 39033238 PMCID: PMC11271293 DOI: 10.1038/s41746-024-01160-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 06/10/2024] [Indexed: 07/23/2024] Open
Abstract
We propose a method to estimate the household secondary attack rate (hSAR) of COVID-19 in the United Kingdom based on activity on the social media platform X, formerly known as Twitter. Conventional methods of hSAR estimation are resource intensive, requiring regular contact tracing of COVID-19 cases. Our proposed framework provides a complementary method that does not rely on conventional contact tracing or laboratory involvement, including the collection, processing, and analysis of biological samples. We use a text classifier to identify reports of people tweeting about themselves and/or members of their household having COVID-19 infections. A probabilistic analysis is then performed to estimate the hSAR based on the number of self or household, and self and household tweets of COVID-19 infection. The analysis includes adjustments for a reluctance of Twitter users to tweet about household members, and the possibility that the secondary infection was not acquired within the household. Experimental results for the UK, both monthly and weekly, are reported for the period from January 2020 to February 2022. Our results agree with previously reported hSAR estimates, varying with the primary variants of concern, e.g. delta and omicron. The serial interval (SI) is based on the time between the two tweets that indicate a primary and secondary infection. Experimental results, though larger than the consensus, are qualitatively similar. The estimation of hSAR and SI using social media data constitutes a new tool that may help in characterizing, forecasting and managing outbreaks and pandemics in a faster, affordable, and more efficient manner.
Collapse
Affiliation(s)
- Aarzoo Dhiman
- Department of Computer Science, University College London, London, UK.
- Centre of Excellence for Data Science, AI and Modelling, University of Hull, Hull, UK.
| | - Elad Yom-Tov
- Microsoft Research, Herzliya, Israel
- Department of Computer Science, Bar Ilan University, Ramat Gan, Israel
| | - Lorenzo Pellis
- Department of Mathematics, University of Manchester, Manchester, UK
| | | | - Richard Pebody
- UK Health Security Agency, 61 Collingdate Avenue, NW9 5EQ, London, UK
| | - Andrew Hayward
- UCL Collaborative Centre for Inclusion Health, UCL, London, UK
| | - Thomas House
- Department of Mathematics, University of Manchester, Manchester, UK
| | - Thomas Finnie
- UK Health Security Agency, 61 Collingdate Avenue, NW9 5EQ, London, UK
| | - David Guzman
- Department of Computer Science, University College London, London, UK
| | - Vasileios Lampos
- Department of Computer Science, University College London, London, UK.
| | - Ingemar J Cox
- Department of Computer Science, University College London, London, UK.
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
15
|
Hua Y, Wu J, Lin S, Li M, Zhang Y, Foer D, Wang S, Zhou P, Yang J, Zhou L. Streamlining social media information retrieval for public health research with deep learning. J Am Med Inform Assoc 2024; 31:1569-1577. [PMID: 38718216 PMCID: PMC11187427 DOI: 10.1093/jamia/ocae118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/28/2024] [Accepted: 05/07/2024] [Indexed: 06/21/2024] Open
Abstract
OBJECTIVE Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a Unified Medical Language System (UMLS)-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. METHODS COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity samples were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. RESULTS We identified 498 480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18 226. The final dictionary contains 38 175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. CONCLUSIONS This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret, and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.
Collapse
Affiliation(s)
- Yining Hua
- Department of Epidemiology, Harvard Chan School of Public Health, Boston, MA 02115, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02145, United States
| | - Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310058, China
| | - Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310058, China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310058, China
| | - Yujie Zhang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310058, China
| | - Dinah Foer
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02145, United States
| | - Siwen Wang
- Department of Epidemiology, Harvard Chan School of Public Health, Boston, MA 02115, United States
| | - Peilin Zhou
- Thrust of Data Science and Analytics, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 511458, China
| | - Jie Yang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02120, United States
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02145, United States
| |
Collapse
|
16
|
Allen J, Watts DJ, Rand DG. Quantifying the impact of misinformation and vaccine-skeptical content on Facebook. Science 2024; 384:eadk3451. [PMID: 38815040 DOI: 10.1126/science.adk3451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 04/17/2024] [Indexed: 06/01/2024]
Abstract
Low uptake of the COVID-19 vaccine in the US has been widely attributed to social media misinformation. To evaluate this claim, we introduce a framework combining lab experiments (total N = 18,725), crowdsourcing, and machine learning to estimate the causal effect of 13,206 vaccine-related URLs on the vaccination intentions of US Facebook users (N ≈ 233 million). We estimate that the impact of unflagged content that nonetheless encouraged vaccine skepticism was 46-fold greater than that of misinformation flagged by fact-checkers. Although misinformation reduced predicted vaccination intentions significantly more than unflagged vaccine content when viewed, Facebook users' exposure to flagged content was limited. In contrast, unflagged stories highlighting rare deaths after vaccination were among Facebook's most-viewed stories. Our work emphasizes the need to scrutinize factually accurate but potentially misleading content in addition to outright falsehoods.
Collapse
Affiliation(s)
- Jennifer Allen
- Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Duncan J Watts
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
- Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA, USA
- Operations, Information, and Decisions Department, University of Pennsylvania, Philadelphia, PA, USA
| | - David G Rand
- Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA
- Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
17
|
Niu Y, Li Z, Chen Z, Huang W, Tan J, Tian F, Yang T, Fan Y, Wei J, Mu J. Efficient screening of pharmacological broad-spectrum anti-cancer peptides utilizing advanced bidirectional Encoder representation from Transformers strategy. Heliyon 2024; 10:e30373. [PMID: 38765108 PMCID: PMC11101728 DOI: 10.1016/j.heliyon.2024.e30373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/24/2024] [Accepted: 04/24/2024] [Indexed: 05/21/2024] Open
Abstract
In the vanguard of oncological advancement, this investigation delineates the integration of deep learning paradigms to refine the screening process for Anticancer Peptides (ACPs), epitomizing a new frontier in broad-spectrum oncolytic therapeutics renowned for their targeted antitumor efficacy and specificity. Conventional methodologies for ACP identification are marred by prohibitive time and financial exigencies, representing a formidable impediment to the evolution of precision oncology. In response, our research heralds the development of a groundbreaking screening apparatus that marries Natural Language Processing (NLP) with the Pseudo Amino Acid Composition (PseAAC) technique, thereby inaugurating a comprehensive ACP compendium for the extraction of quintessential primary and secondary structural attributes. This innovative methodological approach is augmented by an optimized BERT model, meticulously calibrated for ACP detection, which conspicuously surpasses existing BERT variants and traditional machine learning algorithms in both accuracy and selectivity. Subjected to rigorous validation via five-fold cross-validation and external assessment, our model exhibited exemplary performance, boasting an average Area Under the Curve (AUC) of 0.9726 and an F1 score of 0.9385, with external validation further affirming its prowess (AUC of 0.9848 and F1 of 0.9371). These findings vividly underscore the method's unparalleled efficacy and prospective utility in the precise identification and prognostication of ACPs, significantly ameliorating the financial and temporal burdens traditionally associated with ACP research and development. Ergo, this pioneering screening paradigm promises to catalyze the discovery and clinical application of ACPs, constituting a seminal stride towards the realization of more efficacious and economically viable precision oncology interventions.
Collapse
Affiliation(s)
- Yupeng Niu
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Zhenghao Li
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Ziao Chen
- College of Law, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Wenyuan Huang
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Jingxuan Tan
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Fa Tian
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
| | - Tao Yang
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Yamin Fan
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| | - Jiangshu Wei
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
| | - Jiong Mu
- College of Information Engineering, Sichuan Agricultural University, Ya'an 625000, China
- Artificial intelligence laboratory, Sichuan Agricultural University, Ya'an 625000, China
| |
Collapse
|
18
|
He W, Zhang W, Jin Y, Zhou Q, Zhang H, Xia Q. Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis. J Med Internet Res 2024; 26:e54706. [PMID: 38687566 PMCID: PMC11094593 DOI: 10.2196/54706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 03/20/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND There is a dearth of feasibility assessments regarding using large language models (LLMs) for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant research focus on applying these models in the medical field has been on English-speaking populations. OBJECTIVE This study aims to assess the effectiveness of LLM chatbots, specifically ChatGPT-4 (OpenAI) and ERNIE Bot (version 2.2.3; Baidu, Inc), one of the most advanced LLMs in China, in addressing inquiries from autistic individuals in a Chinese setting. METHODS For this study, we gathered data from DXY-a widely acknowledged, web-based, medical consultation platform in China with a user base of over 100 million individuals. A total of 100 patient consultation samples were rigorously selected from January 2018 to August 2023, amounting to 239 questions extracted from publicly available autism-related documents on the platform. To maintain objectivity, both the original questions and responses were anonymized and randomized. An evaluation team of 3 chief physicians assessed the responses across 4 dimensions: relevance, accuracy, usefulness, and empathy. The team completed 717 evaluations. The team initially identified the best response and then used a Likert scale with 5 response categories to gauge the responses, each representing a distinct level of quality. Finally, we compared the responses collected from different sources. RESULTS Among the 717 evaluations conducted, 46.86% (95% CI 43.21%-50.51%) of assessors displayed varying preferences for responses from physicians, with 34.87% (95% CI 31.38%-38.36%) of assessors favoring ChatGPT and 18.27% (95% CI 15.44%-21.10%) of assessors favoring ERNIE Bot. The average relevance scores for physicians, ChatGPT, and ERNIE Bot were 3.75 (95% CI 3.69-3.82), 3.69 (95% CI 3.63-3.74), and 3.41 (95% CI 3.35-3.46), respectively. Physicians (3.66, 95% CI 3.60-3.73) and ChatGPT (3.73, 95% CI 3.69-3.77) demonstrated higher accuracy ratings compared to ERNIE Bot (3.52, 95% CI 3.47-3.57). In terms of usefulness scores, physicians (3.54, 95% CI 3.47-3.62) received higher ratings than ChatGPT (3.40, 95% CI 3.34-3.47) and ERNIE Bot (3.05, 95% CI 2.99-3.12). Finally, concerning the empathy dimension, ChatGPT (3.64, 95% CI 3.57-3.71) outperformed physicians (3.13, 95% CI 3.04-3.21) and ERNIE Bot (3.11, 95% CI 3.04-3.18). CONCLUSIONS In this cross-sectional study, physicians' responses exhibited superiority in the present Chinese-language context. Nonetheless, LLMs can provide valuable medical guidance to autistic patients and may even surpass physicians in demonstrating empathy. However, it is crucial to acknowledge that further optimization and research are imperative prerequisites before the effective integration of LLMs in clinical settings across diverse linguistic environments can be realized. TRIAL REGISTRATION Chinese Clinical Trial Registry ChiCTR2300074655; https://www.chictr.org.cn/bin/project/edit?pid=199432.
Collapse
Affiliation(s)
- Wenjie He
- Tianjin University of Traditional Chinese Medicine, Tianjin, China
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Wenyan Zhang
- Lanzhou University Second Hospital, Lanzhou University, Lanzhou, China
| | - Ya Jin
- Dongguan Songshan Lake Central Hospital, Guangdong Medical University, Dongguan, China
| | - Qiang Zhou
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Huadan Zhang
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Qing Xia
- Tianjin University of Traditional Chinese Medicine, Tianjin, China
| |
Collapse
|
19
|
Dong F, Guo W, Liu J, Patterson TA, Hong H. BERT-based language model for accurate drug adverse event extraction from social media: implementation, evaluation, and contributions to pharmacovigilance practices. Front Public Health 2024; 12:1392180. [PMID: 38716250 PMCID: PMC11074401 DOI: 10.3389/fpubh.2024.1392180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 04/11/2024] [Indexed: 05/18/2024] Open
Abstract
Introduction Social media platforms serve as a valuable resource for users to share health-related information, aiding in the monitoring of adverse events linked to medications and treatments in drug safety surveillance. However, extracting drug-related adverse events accurately and efficiently from social media poses challenges in both natural language processing research and the pharmacovigilance domain. Method Recognizing the lack of detailed implementation and evaluation of Bidirectional Encoder Representations from Transformers (BERT)-based models for drug adverse event extraction on social media, we developed a BERT-based language model tailored to identifying drug adverse events in this context. Our model utilized publicly available labeled adverse event data from the ADE-Corpus-V2. Constructing the BERT-based model involved optimizing key hyperparameters, such as the number of training epochs, batch size, and learning rate. Through ten hold-out evaluations on ADE-Corpus-V2 data and external social media datasets, our model consistently demonstrated high accuracy in drug adverse event detection. Result The hold-out evaluations resulted in average F1 scores of 0.8575, 0.9049, and 0.9813 for detecting words of adverse events, words in adverse events, and words not in adverse events, respectively. External validation using human-labeled adverse event tweets data from SMM4H further substantiated the effectiveness of our model, yielding F1 scores 0.8127, 0.8068, and 0.9790 for detecting words of adverse events, words in adverse events, and words not in adverse events, respectively. Discussion This study not only showcases the effectiveness of BERT-based language models in accurately identifying drug-related adverse events in the dynamic landscape of social media data, but also addresses the need for the implementation of a comprehensive study design and evaluation. By doing so, we contribute to the advancement of pharmacovigilance practices and methodologies in the context of emerging information sources like social media.
Collapse
Affiliation(s)
| | | | | | | | - Huixiao Hong
- National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, United States
| |
Collapse
|
20
|
Lyu D, Wang X, Chen Y, Wang F. Language model and its interpretability in biomedicine: A scoping review. iScience 2024; 27:109334. [PMID: 38495823 PMCID: PMC10940999 DOI: 10.1016/j.isci.2024.109334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2024] Open
Abstract
With advancements in large language models, artificial intelligence (AI) is undergoing a paradigm shift where AI models can be repurposed with minimal effort across various downstream tasks. This provides great promise in learning generally useful representations from biomedical corpora, at scale, which would empower AI solutions in healthcare and biomedical research. Nonetheless, our understanding of how they work, when they fail, and what they are capable of remains underexplored due to their emergent properties. Consequently, there is a need to comprehensively examine the use of language models in biomedicine. This review aims to summarize existing studies of language models in biomedicine and identify topics ripe for future research, along with the technical and analytical challenges w.r.t. interpretability. We expect this review to help researchers and practitioners better understand the landscape of language models in biomedicine and what methods are available to enhance the interpretability of their models.
Collapse
Affiliation(s)
- Daoming Lyu
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Xingbo Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Fei Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| |
Collapse
|
21
|
Klein AZ, Banda JM, Guo Y, Schmidt AL, Xu D, Flores Amaro I, Rodriguez-Esteban R, Sarker A, Gonzalez-Hernandez G. Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium. J Am Med Inform Assoc 2024; 31:991-996. [PMID: 38218723 PMCID: PMC10990511 DOI: 10.1093/jamia/ocae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 01/05/2024] [Accepted: 01/11/2024] [Indexed: 01/15/2024] Open
Abstract
OBJECTIVE The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants' systems, and the performance results. METHODS The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). RESULTS In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. CONCLUSION To facilitate future work, the datasets-a total of 61 353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.
Collapse
Affiliation(s)
- Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, United States
| | - Yuting Guo
- Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, United States
| | | | - Dongfang Xu
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Ivan Flores Amaro
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | | | - Abeed Sarker
- Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, United States
| | | |
Collapse
|
22
|
JIANG Y, KAVULURU R. COVID-19 Event Extraction from Twitter via Extractive Question Answering with Continuous Prompts. Stud Health Technol Inform 2024; 310:674-678. [PMID: 38269894 PMCID: PMC10843254 DOI: 10.3233/shti231050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
As COVID-19 ravages the world, social media analytics could augment traditional surveys in assessing how the pandemic evolves and capturing consumer chatter that could help healthcare agencies in addressing it. This typically involves mining disclosure events that mention testing positive for the disease or discussions surrounding perceptions and beliefs in preventative or treatment options. The 2020 shared task on COVID-19 event extraction (conducted as part of the W-NUT workshop during the EMNLP conference) introduced a new Twitter dataset for benchmarking event extraction from COVID-19 tweets. In this paper, we cast the problem of event extraction as extractive question answering using recent advances in continuous prompting in language models. On the shared task test dataset, our approach leads to over 5% absolute micro-averaged F1-score improvement over prior best results, across all COVID-19 event slots. Our ablation study shows that continuous prompts have a major impact on the eventual performance.
Collapse
Affiliation(s)
- Yuhang JIANG
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth KAVULURU
- Division of Biomedical Informatics, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
23
|
Klein AZ, Banda JM, Guo Y, Schmidt AL, Xu D, Amaro JIF, Rodriguez-Esteban R, Sarker A, Gonzalez-Hernandez G. Overview of the 8 th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at the AMIA 2023 Annual Symposium. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.06.23298168. [PMID: 37986776 PMCID: PMC10659479 DOI: 10.1101/2023.11.06.23298168] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of five tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). In total, 29 teams registered, representing 18 countries. In this paper, we present the annotated corpora, a technical summary of the systems, and the performance results. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. To facilitate future work, the datasets-a total of 61,353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.
Collapse
Affiliation(s)
- Ari Z. Klein
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Juan M. Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Yuting Guo
- Department of Biomedical Informatics, Emory University, Atlanta, GA, USA
| | | | - Dongfang Xu
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | | | | | - Abeed Sarker
- Department of Biomedical Informatics, Emory University, Atlanta, GA, USA
| | | |
Collapse
|
24
|
Zhou X, Song S, Zhang Y, Hou Z. Deep Learning Analysis of COVID-19 Vaccine Hesitancy and Confidence Expressed on Twitter in 6 High-Income Countries: Longitudinal Observational Study. J Med Internet Res 2023; 25:e49753. [PMID: 37930788 PMCID: PMC10629504 DOI: 10.2196/49753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 09/17/2023] [Accepted: 10/03/2023] [Indexed: 11/07/2023] Open
Abstract
BACKGROUND An ongoing monitoring of national and subnational trajectory of COVID-19 vaccine hesitancy could offer support in designing tailored policies on improving vaccine uptake. OBJECTIVE We aim to track the temporal and spatial distribution of COVID-19 vaccine hesitancy and confidence expressed on Twitter during the entire pandemic period in major English-speaking countries. METHODS We collected 5,257,385 English-language tweets regarding COVID-19 vaccination between January 1, 2020, and June 30, 2022, in 6 countries-the United States, the United Kingdom, Australia, New Zealand, Canada, and Ireland. Transformer-based deep learning models were developed to classify each tweet as intent to accept or reject COVID-19 vaccination and the belief that COVID-19 vaccine is effective or unsafe. Sociodemographic factors associated with COVID-19 vaccine hesitancy and confidence in the United States were analyzed using bivariate and multivariable linear regressions. RESULTS The 6 countries experienced similar evolving trends of COVID-19 vaccine hesitancy and confidence. On average, the prevalence of intent to accept COVID-19 vaccination decreased from 71.38% of 44,944 tweets in March 2020 to 34.85% of 48,167 tweets in June 2022 with fluctuations. The prevalence of believing COVID-19 vaccines to be unsafe continuously rose by 7.49 times from March 2020 (2.84% of 44,944 tweets) to June 2022 (21.27% of 48,167 tweets). COVID-19 vaccine hesitancy and confidence varied by country, vaccine manufacturer, and states within a country. The democrat party and higher vaccine confidence were significantly associated with lower vaccine hesitancy across US states. CONCLUSIONS COVID-19 vaccine hesitancy and confidence evolved and were influenced by the development of vaccines and viruses during the pandemic. Large-scale self-generated discourses on social media and deep learning models provide a cost-efficient approach to monitoring routine vaccine hesitancy.
Collapse
Affiliation(s)
- Xinyu Zhou
- School of Public Health, Fudan University, Shanghai, China
- Global Health Institute, Fudan University, Shanghai, China
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, United States
| | - Suhang Song
- Department of Health Policy and Management, College of Public Health, University of Georgia, Athens, GA, United States
| | - Ying Zhang
- School of Public Health, Fudan University, Shanghai, China
- Global Health Institute, Fudan University, Shanghai, China
| | - Zhiyuan Hou
- School of Public Health, Fudan University, Shanghai, China
- Global Health Institute, Fudan University, Shanghai, China
| |
Collapse
|
25
|
Wu L, Gray M, Dang O, Xu J, Fang H, Tong W. RxBERT: Enhancing drug labeling text mining and analysis with AI language modeling. Exp Biol Med (Maywood) 2023; 248:1937-1943. [PMID: 38166420 PMCID: PMC10798181 DOI: 10.1177/15353702231220669] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 11/02/2023] [Indexed: 01/04/2024] Open
Abstract
The US drug labeling document contains essential information on drug efficacy and safety, making it a crucial regulatory resource for Food and Drug Administration (FDA) drug reviewers. Due to its extensive volume and the presence of free-text, conventional text mining analysis have encountered challenges in processing these data. Recent advances in artificial intelligence (AI) for natural language processing (NLP) have provided an unprecedented opportunity to identify key information from drug labeling, thereby enhancing safety reviews and support for regulatory decisions. We developed RxBERT, a Bidirectional Encoder Representations from Transformers (BERT) model pretrained on FDA human prescription drug labeling documents for an enhanced application of drug labeling documents in both research and drug review. RxBERT was derived from BioBERT with further training on human prescription drug labeling documents. RxBERT was demonstrated in several tasks using regulatory datasets, including those involved in the National Institutes of Technology Text Analysis Challenge Dataset (NIST TAC dataset), the FDA Adverse Drug Event Evaluation Dataset (ADE Eval dataset), and the classification of texts from submission packages into labeling sections (US Drug Labeling dataset). For all these tasks, RxBERT reached 86.5 F1-scores in both TAC and ADE Eval classification, respectively, and prediction accuracy of 87% for the US Drug Labeling dataset. Overall, RxBERT was shown to be as competitive or have better performance compared to other NLP approaches such as BERT, BioBERT, etc. In summary, we developed RxBERT, a transformer-based model specific for drug labeling that outperformed the original BERT model. RxBERT has the potential to be used to assist research scientists and FDA reviewers to better process and utilize drug labeling information toward the advancement of drug effectiveness and safety for public health. This proof-of-concept study also demonstrated a potential pathway to customized large language models (LLMs) tailored to the sensitive regulatory documents for internal application.
Collapse
Affiliation(s)
- Leihong Wu
- Division of Bioinformatics and Biostatistics, FDA National Center for Toxicological Research, Jefferson, AR 72079, USA
| | - Magnus Gray
- Division of Bioinformatics and Biostatistics, FDA National Center for Toxicological Research, Jefferson, AR 72079, USA
| | - Oanh Dang
- Office of Surveillance and Epidemiology, FDA Center for Drug Evaluation and Research, Silver Spring, MD 20993, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, FDA National Center for Toxicological Research, Jefferson, AR 72079, USA
| | - Hong Fang
- Office of Scientific Coordination, FDA National Center for Toxicological Research, Jefferson, AR 72079, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, FDA National Center for Toxicological Research, Jefferson, AR 72079, USA
| |
Collapse
|
26
|
Dolatabadi E, Moyano D, Bales M, Spasojevic S, Bhambhoria R, Bhatti J, Debnath S, Hoell N, Li X, Leng C, Nanda S, Saab J, Sahak E, Sie F, Uppal S, Vadlamudi NK, Vladimirova A, Yakimovich A, Yang X, Kocak SA, Cheung AM. Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach. J Med Internet Res 2023; 25:e45767. [PMID: 37725432 PMCID: PMC10510753 DOI: 10.2196/45767] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 05/18/2023] [Accepted: 06/05/2023] [Indexed: 09/21/2023] Open
Abstract
BACKGROUND While scientific knowledge of post-COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians. OBJECTIVE In this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline's potential as a surveillance tool. METHODS We used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries. RESULTS UmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada. CONCLUSIONS The outcome of our social media-derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient's journey that can help health care providers anticipate future needs. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.1101/2022.12.14.22283419.
Collapse
Affiliation(s)
- Elham Dolatabadi
- Faculty of Health, School of Health Policy and Management, York University, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | | | | | | | - Rohan Bhambhoria
- Electrical and Computer Engineering, Queen's University, Kingston, ON, Canada
| | | | | | | | - Xin Li
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | | | | | - Jad Saab
- TELUS Health, Montreal, QC, Canada
| | - Esmat Sahak
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | - Fanny Sie
- Hoffmann-La Roche Ltd, Toronto, ON, Canada
| | | | - Nirma Khatri Vadlamudi
- Department of Pediatrics, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
| | | | | | | | | | - Angela M Cheung
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
- University Health Network, Toronto, ON, Canada
| |
Collapse
|
27
|
Tian Y, Zhang W, Duan L, McDonald W, Osgood N. Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada. Front Digit Health 2023; 5:1203874. [PMID: 37448834 PMCID: PMC10338115 DOI: 10.3389/fdgth.2023.1203874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 06/02/2023] [Indexed: 07/15/2023] Open
Abstract
Background The use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions. Objective The first aim of this study is to investigate the degree to which Twitter users disclose health experiences related to influenza and COVID-19 that could be indicative of recent plausible influenza cases or symptomatic COVID-19 infections. Second, we seek to use the Twitter datasets to train and evaluate the classification performance of Bidirectional Encoder Representations from Transformers (BERT) and variant language models in the context of influenza and COVID-19 infection detection. Methods We constructed two Twitter datasets using a keyword-based filtering approach on English-language tweets collected from December 2016 to December 2022 in Saskatchewan, Canada. The influenza-related dataset comprised tweets filtered with influenza-related keywords from December 13, 2016, to March 17, 2018, while the COVID-19 dataset comprised tweets filtered with COVID-19 symptom-related keywords from January 1, 2020, to June 22, 2021. The Twitter datasets were cleaned, and each tweet was annotated by at least two annotators as to whether it suggested recent plausible influenza cases or symptomatic COVID-19 cases. We then assessed the classification performance of pre-trained transformer-based language models, including BERT-base, BERT-large, RoBERTa-base, RoBERT-large, BERTweet-base, BERTweet-covid-base, BERTweet-large, and COVID-Twitter-BERT (CT-BERT) models, on each dataset. To address the notable class imbalance, we experimented with both oversampling and undersampling methods. Results The influenza dataset had 1129 out of 6444 (17.5%) tweets annotated as suggesting recent plausible influenza cases. The COVID-19 dataset had 924 out of 11939 (7.7%) tweets annotated as inferring recent plausible COVID-19 cases. When compared against other language models on the COVID-19 dataset, CT-BERT performed the best, supporting the highest scores for recall (94.8%), F1(94.4%), and accuracy (94.6%). For the influenza dataset, BERTweet models exhibited better performance. Our results also showed that applying data balancing techniques such as oversampling or undersampling method did not lead to improved model performance. Conclusions Utilizing domain-specific language models for monitoring users' health experiences related to influenza and COVID-19 on social media shows improved classification performance and has the potential to supplement real-time disease surveillance.
Collapse
Affiliation(s)
| | | | | | | | - Nathaniel Osgood
- Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada
| |
Collapse
|
28
|
Dupuy-Zini A, Audeh B, Gérardin C, Duclos C, Gagneux-Brunon A, Bousquet C. Users' Reactions to Announced Vaccines Against COVID-19 Before Marketing in France: Analysis of Twitter Posts. J Med Internet Res 2023; 25:e37237. [PMID: 36596215 PMCID: PMC10132828 DOI: 10.2196/37237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 07/17/2022] [Accepted: 08/09/2022] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Within a few months, the COVID-19 pandemic had spread to many countries and had been a real challenge for health systems all around the world. This unprecedented crisis has led to a surge of online discussions about potential cures for the disease. Among them, vaccines have been at the heart of the debates and have faced lack of confidence before marketing in France. OBJECTIVE This study aims to identify and investigate the opinions of French Twitter users on the announced vaccines against COVID-19 through sentiment analysis. METHODS This study was conducted in 2 phases. First, we filtered a collection of tweets related to COVID-19 available on Twitter from February 2020 to August 2020 with a set of keywords associated with vaccine mistrust using word embeddings. Second, we performed sentiment analysis using deep learning to identify the characteristics of vaccine mistrust. The model was trained on a hand-labeled subset of 4548 tweets. RESULTS A set of 69 relevant keywords were identified as the semantic concept of the word "vaccin" (vaccine in French) and focused mainly on conspiracies, pharmaceutical companies, and alternative treatments. Those keywords enabled us to extract nearly 350,000 tweets in French. The sentiment analysis model achieved 0.75 accuracy. The model then predicted 16% of positive tweets, 41% of negative tweets, and 43% of neutral tweets. This allowed us to explore the semantic concepts of positive and negative tweets and to plot the trends of each sentiment. The main negative rhetoric identified from users' tweets was that vaccines are perceived as having a political purpose and that COVID-19 is a commercial argument for the pharmaceutical companies. CONCLUSIONS Twitter might be a useful tool to investigate the arguments for vaccine mistrust because it unveils political criticism contrasting with the usual concerns on adverse drug reactions. As the opposition rhetoric is more consistent and more widely spread than the positive rhetoric, we believe that this research provides effective tools to help health authorities better characterize the risk of vaccine mistrust.
Collapse
Affiliation(s)
- Alexandre Dupuy-Zini
- Laboratoire d'Informatique Médicale et d'Ingénierie des connaissances en e-Santé, LIMICS, Sorbonne Université, Université Sorbonne Paris Nord, Institut national de la santé et de la recherche médicale, INSERM, Paris, France
| | - Bissan Audeh
- Laboratoire d'Informatique Médicale et d'Ingénierie des connaissances en e-Santé, LIMICS, Sorbonne Université, Université Sorbonne Paris Nord, Institut national de la santé et de la recherche médicale, INSERM, Paris, France
| | - Christel Gérardin
- Institut Pierre Louis d'Epidémiologie et de Santé Publique, Département de médecine interne, Sorbonne Université, Paris, France
| | - Catherine Duclos
- Laboratoire d'Informatique Médicale et d'Ingénierie des connaissances en e-Santé, LIMICS, Sorbonne Université, Université Sorbonne Paris Nord, Institut national de la santé et de la recherche médicale, INSERM, Paris, France
| | - Amandine Gagneux-Brunon
- Groupe sur l'Immunité des Muqueuses et Agents Pathogènes, Centre International de Recherche en Infectiologie, University of Lyon, Saint Etienne, France
- Vaccinologie, Centre Hospitalier Universitaire de Saint-Etienne, Saint Etienne, France
| | - Cedric Bousquet
- Laboratoire d'Informatique Médicale et d'Ingénierie des connaissances en e-Santé, LIMICS, Sorbonne Université, Université Sorbonne Paris Nord, Institut national de la santé et de la recherche médicale, INSERM, Paris, France
- Service de santé publique et information médicale, Centre Hospitalier Universitaire de Saint Etienne, Saint Etienne, France
| |
Collapse
|
29
|
Didi Y, Walha A, Ben Halima M, Wali A. COVID-19 Outbreak Forecasting Based on Vaccine Rates and Tweets Classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4535541. [PMID: 36337272 PMCID: PMC9633186 DOI: 10.1155/2022/4535541] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/17/2022] [Accepted: 10/05/2022] [Indexed: 09/08/2024]
Abstract
The spread of COVID-19 has affected more than 200 countries and has caused serious public health concerns. The infected cases are on the increase despite the effectiveness of the vaccines. An efficient and quick surveillance system for COVID-19 can help healthcare decision-makers to contain the virus spread. In this study, we developed a novel framework using machine learning (ML) models capable of detecting COVID-19 accurately at an early stage. To estimate the risks, many models use social networking sites (SNSs) in tracking the disease outbreak. Twitter is one of the SNSs that is widely used to create an efficient resource for disease real-time analysis and can provide an early warning for health officials. We introduced a pipeline framework of outbreak prediction that incorporates a first-step hybrid method of word embedding for tweet classification. In the second step, we considered the classified tweets with external features such as vaccine rate associated with infected cases passed to machine learning algorithms for daily predictions. Thus, we applied different machine learning models such as the SVM, RF, and LR for classification and the LSTM, Prophet, and SVR for prediction. For the hybrid word embedding techniques, we applied TF-IDF, FastText, and Glove and a combination of the three features to enhance the classification. Furthermore, to improve the forecast performance, we incorporated vaccine data as input together with tweets and confirmed cases. The models' performance is more than 80% accurate, which shows the reliability of the proposed study.
Collapse
Affiliation(s)
- Y. Didi
- Department of Computer Science, Umm Al-Qura University, Makkah 24243, Saudi Arabia
- REsearch Groups in Intelligent Machines (REGIM-Lab), National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia
| | - A. Walha
- Department of Computer Science, Umm Al-Qura University, Makkah 24243, Saudi Arabia
- REsearch Groups in Intelligent Machines (REGIM-Lab), National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia
| | - M. Ben Halima
- REsearch Groups in Intelligent Machines (REGIM-Lab), National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia
| | - A. Wali
- REsearch Groups in Intelligent Machines (REGIM-Lab), National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia
| |
Collapse
|