1
|
Sharifpoor E, Okhovati M, Ghazizadeh-Ahsaee M, Avaz Beigi M. Classifying and fact-checking health-related information about COVID-19 on Twitter/X using machine learning and deep learning models. BMC Med Inform Decis Mak 2025; 25:73. [PMID: 39934858 PMCID: PMC11817542 DOI: 10.1186/s12911-025-02895-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 01/28/2025] [Indexed: 02/13/2025] Open
Abstract
BACKGROUND Despite recent progress in misinformation detection methods, further investigation is required to develop more robust fact-checking models with particular consideration for the unique challenges of health information sharing. This study aimed to identify the most effective approach for detecting and classifying reliable information versus misinformation health content shared on Twitter/X related to COVID-19. METHODS We have used 7 different machine learning/deep learning models. Tweets were collected, processed, labeled, and analyzed using relevant keywords and hashtags, then classified into two distinct datasets: "Trustworthy information" versus "Misinformation", through a labeling process. The cosine similarity metric was employed to address oversampling the minority of the Trustworthy information class, ensuring a more balanced representation of both classes for training and testing purposes. Finally, the performance of the various fact-checking models was analyzed and compared using accuracy, precision, recall, and F1-score ROC curve, and AUC. RESULTS For measures of accuracy, precision, F1 score, and recall, the average values of TextConvoNet were found to be 90.28, 90.28, 90.29, and 0.9030, respectively. ROC AUC was 0.901."Trustworthy information" class achieved an accuracy of 85%, precision of 93%, recall of 86%, and F1 score of 89%. These values were higher than other models. Moreover, its performance in the misinformation category was even more impressive, with an accuracy of 94%, precision of 88%, recall of 94%, and F1 score of 91%. CONCLUSION This study showed that TextConvoNet was the most effective in detecting and classifying trustworthy information V.S misinformation related to health issues that have been shared on Twitter/X.
Collapse
Affiliation(s)
- Elham Sharifpoor
- Medical Library and Information Sciences Department, Medical Informatics Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran
| | - Maryam Okhovati
- Medical Library and Information Sciences Department, Medical Informatics Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran.
| | | | - Mina Avaz Beigi
- Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
| |
Collapse
|
2
|
Asaad C, Khaouja I, Ghogho M, Baïna K. When Infodemic Meets Epidemic: Systematic Literature Review. JMIR Public Health Surveill 2025; 11:e55642. [PMID: 39899850 PMCID: PMC11874463 DOI: 10.2196/55642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/25/2024] [Accepted: 05/22/2024] [Indexed: 02/05/2025] Open
Abstract
BACKGROUND Epidemics and outbreaks present arduous challenges, requiring both individual and communal efforts. The significant medical, emotional, and financial burden associated with epidemics creates feelings of distrust, fear, and loss of control, making vulnerable populations prone to exploitation and manipulation through misinformation, rumors, and conspiracies. The use of social media sites has increased in the last decade. As a result, significant amounts of public data can be leveraged for biosurveillance. Social media sites can also provide a platform to quickly and efficiently reach a sizable percentage of the population; therefore, they have a potential role in various aspects of epidemic mitigation. OBJECTIVE This systematic literature review aimed to provide a methodical overview of the integration of social media in 3 epidemic-related contexts: epidemic monitoring, misinformation detection, and the relationship with mental health. The aim is to understand how social media has been used efficiently in these contexts, and which gaps need further research efforts. METHODS Three research questions, related to epidemic monitoring, misinformation, and mental health, were conceptualized for this review. In the first PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) stage, 13,522 publications were collected from several digital libraries (PubMed, IEEE Xplore, ScienceDirect, SpringerLink, MDPI, ACM, and ACL) and gray literature sources (arXiv and ProQuest), spanning from 2010 to 2022. A total of 242 (1.79%) papers were selected for inclusion and were synthesized to identify themes, methods, epidemics studied, and social media sites used. RESULTS Five main themes were identified in the literature, as follows: epidemic forecasting and surveillance, public opinion understanding, fake news identification and characterization, mental health assessment, and association of social media use with psychological outcomes. Social media data were found to be an efficient tool to gauge public response, monitor discourse, identify misleading and fake news, and estimate the mental health toll of epidemics. Findings uncovered a need for more robust applications of lessons learned from epidemic "postmortem documentation." A vast gap exists between retrospective analysis of epidemic management and result integration in prospective studies. CONCLUSIONS Harnessing the full potential of social media in epidemic-related tasks requires streamlining the results of epidemic forecasting, public opinion understanding, and misinformation detection, all while keeping abreast of potential mental health implications. Proactive prevention has thus become vital for epidemic curtailment and containment.
Collapse
Affiliation(s)
- Chaimae Asaad
- TICLab, College of Engineering and Architecture, International University of Rabat, Salé, Morocco
- ENSIAS, Alqualsadi, Rabat IT Center, Mohammed V University, Rabat, Morocco
| | - Imane Khaouja
- TICLab, College of Engineering and Architecture, International University of Rabat, Salé, Morocco
| | - Mounir Ghogho
- TICLab, College of Engineering and Architecture, International University of Rabat, Salé, Morocco
- University of Leeds, Leeds, United Kingdom
| | - Karim Baïna
- ENSIAS, Alqualsadi, Rabat IT Center, Mohammed V University, Rabat, Morocco
| |
Collapse
|
3
|
Lin S, Garay L, Hua Y, Guo Z, Li W, Li M, Zhang Y, Xu X, Yang J. Analysis of longitudinal social media for monitoring symptoms during a pandemic. J Biomed Inform 2025; 162:104778. [PMID: 39832606 DOI: 10.1016/j.jbi.2025.104778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 12/28/2024] [Accepted: 01/15/2025] [Indexed: 01/22/2025]
Abstract
OBJECTIVE Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic. MATERIALS AND METHODS This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User-experienced Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence. RESULTS This study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39 % (832,287) of symptom-sharing tweets reflected user-experienced mentions. The trained USM model achieved an average F1 score of 0.927. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and Wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the Wild-type strain and Delta variant. CONCLUSION This study established a robust framework for analyzing longitudinal social media data to monitor symptoms during a pandemic. By integrating denoising of user-experienced symptom mentions, our findings reveal the duration of different symptoms over time and by variant within a cohort of nearly 200,000 patients, providing critical insights into symptom trends that are often difficult to capture through traditional data source.
Collapse
Affiliation(s)
- Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Lucas Garay
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Yining Hua
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Zhijiang Guo
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Wanxin Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Yujie Zhang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Xiaolin Xu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China; Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
4
|
Shah HA, Househ M. Purpose-oriented review of public health surveillance systems: use of surveillance systems and recent advances. BMJ PUBLIC HEALTH 2024; 2:e000374. [PMID: 40018137 PMCID: PMC11812771 DOI: 10.1136/bmjph-2023-000374] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 05/10/2024] [Indexed: 03/01/2025]
Abstract
Public health surveillance systems are an important tool for disease distribution and burden of disease as well as enable efficient distribution of resources to fight a disease. The surveillance systems are used to detect, report, track a disease as well as assess the response to the disease and people's attitudes. This paper provides a framework of review for purpose-oriented categorisation of public health surveillance systems. The framework for review of surveillance systems divides the systems into distribution or monitoring or prediction oriented. While there can be other categorisation based on data sources and data types used, the framework for review in this paper provides a cohesive system which can engulf such categories. The framework of review in this paper is purpose oriented, which categorises the surveillance system according to their stated objectives, which are the most important aspect of any public health surveillance system. This review and the framework of categorisation provide comprehensive details of the surveillance systems in terms of data types used, source of data and purpose of the surveillance system.
Collapse
|
5
|
Peng Z, Li M, Wang Y, Ho GTS. Combating the COVID-19 infodemic using Prompt-Based curriculum learning. EXPERT SYSTEMS WITH APPLICATIONS 2023; 229:120501. [PMID: 37274611 PMCID: PMC10193815 DOI: 10.1016/j.eswa.2023.120501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 05/14/2023] [Accepted: 05/15/2023] [Indexed: 06/06/2023]
Abstract
The COVID-19 pandemic has been accompanied by a proliferation of online misinformation and disinformation about the virus. Combating this 'infodemic' has been identified as one of the top priorities of the World Health Organization, because false and misleading information can lead to a range of negative consequences, including the spread of false remedies, conspiracy theories, and xenophobia. This paper aims to combat the COVID-19 infodemic on multiple fronts, including determining the credibility of information, identifying its potential harm to society, and the necessity of intervention by relevant organizations. We present a prompt-based curriculum learning method to achieve this goal. The proposed method could overcome the challenges of data sparsity and class imbalance issues. Using online social media texts as input, the proposed model can verify content from multiple perspectives by answering a series of questions concerning the text's reliability. Experiments revealed the effectiveness of prompt tuning and curriculum learning in assessing the reliability of COVID-19-related text. The proposed method outperforms typical text classification methods, including fastText and BERT. In addition, the proposed method is robust to the hyperparameter settings, making it more applicable with limited infrastructure resources.
Collapse
Affiliation(s)
- Zifan Peng
- Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Mingchen Li
- Khoury College of Computer Sciences, Northeastern University, Boston, USA
| | - Yue Wang
- Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Hong Kong SAR, China
| | - George T S Ho
- Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
6
|
Swapnarekha H, Nayak J, Behera HS, Dash PB, Pelusi D. An optimistic firefly algorithm-based deep learning approach for sentiment analysis of COVID-19 tweets. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2382-2407. [PMID: 36899539 DOI: 10.3934/mbe.2023112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The unprecedented rise in the number of COVID-19 cases has drawn global attention, as it has caused an adverse impact on the lives of people all over the world. As of December 31, 2021, more than 2, 86, 901, 222 people have been infected with COVID-19. The rise in the number of COVID-19 cases and deaths across the world has caused fear, anxiety and depression among individuals. Social media is the most dominant tool that disturbed human life during this pandemic. Among the social media platforms, Twitter is one of the most prominent and trusted social media platforms. To control and monitor the COVID-19 infection, it is necessary to analyze the sentiments of people expressed on their social media platforms. In this study, we proposed a deep learning approach known as a long short-term memory (LSTM) model for the analysis of tweets related to COVID-19 as positive or negative sentiments. In addition, the proposed approach makes use of the firefly algorithm to enhance the overall performance of the model. Further, the performance of the proposed model, along with other state-of-the-art ensemble and machine learning models, has been evaluated by using performance metrics such as accuracy, precision, recall, the AUC-ROC and the F1-score. The experimental results reveal that the proposed LSTM + Firefly approach obtained a better accuracy of 99.59% when compared with the other state-of-the-art models.
Collapse
Affiliation(s)
- H Swapnarekha
- Department of Information Technology, Aditya Institute of Technology and Management (AITAM), Tekkali, Andhra Pradesh 532201, India
- Department of Information Technology, Veer Surendra Sai University of Technology, Burla 768018, India
| | - Janmenjoy Nayak
- Department of Computer Science, Maharaja Sriram Chandra Bhanja Deo University, Baripada, Odisha 757003, India
| | - H S Behera
- Department of Information Technology, Veer Surendra Sai University of Technology, Burla 768018, India
| | - Pandit Byomakesha Dash
- Department of Information Technology, Aditya Institute of Technology and Management (AITAM), Tekkali, Andhra Pradesh 532201, India
| | - Danilo Pelusi
- Communication Sciences, University of Teramo, Coste Sant'agostino Campus, Teramo 64100, Italy
| |
Collapse
|
7
|
Singh S, Dhir S, Sushil S. Developing an evidence-based TISM: an application for the success of COVID-19 Vaccination Drive. ANNALS OF OPERATIONS RESEARCH 2022:1-19. [PMID: 36533277 PMCID: PMC9734704 DOI: 10.1007/s10479-022-05098-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 10/10/2022] [Accepted: 11/18/2022] [Indexed: 06/17/2023]
Abstract
The study illustrates an application of evidence data for performing Total Interpretive Structural Modeling (TISM). TISM is widely used to analyze the critical success factors or inhibitors and their interlinkages. This study uses learning from evidence data, specifically social media analytics, to identify the relationship between the elements. Thus, it leads to the advancement of the TISM-P methodology with evidence-based TISM (TISM-E). This study uses Twitter as a source of evidence data. Further, 2,60,297 tweets were used to illustrate the process of TISM-E. The paper demonstrates the application of TISM-E for the success of the COVID-19 vaccination drive. The pandemic effects are long-term; therefore, the hierarchical model developed shows a sustainable approach for vaccinating maximum population. Further, the framework developed will ensure the distribution efficacy of vaccines. In addition, it will aid practitioners in developing future vaccination policies. The enhanced model provides evidence for polarity (positive/negative) of relationships and can help to build propositions for theory development. The study contributes to healthcare, modeling research, and graph-theoretic literature.
Collapse
Affiliation(s)
- Shiwangi Singh
- Indian Institute of Management Ranchi, Ranchi, Jharkhand, India
| | - Sanjay Dhir
- Indian Institute of Technology Delhi, New Delhi, India
| | - Sushil Sushil
- Indian Institute of Technology Delhi, New Delhi, India
| |
Collapse
|
8
|
Joseph LP, Joseph EA, Prasad R. Explainable diabetes classification using hybrid Bayesian-optimized TabNet architecture. Comput Biol Med 2022; 151:106178. [PMID: 36306578 DOI: 10.1016/j.compbiomed.2022.106178] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 09/23/2022] [Accepted: 10/01/2022] [Indexed: 12/27/2022]
Abstract
Diabetes is a deadly chronic disease that occurs when the pancreas is not able to produce ample insulin or when the body cannot use insulin effectively. If undetected, it may lead to a host of health complications. Hence, accurate and explainable early-stage detection of diabetes is essential for the proper administration of treatment options in leading a healthy and productive life. For this, we developed an interpretable TabNet model tuned via Bayesian optimization (BO). To achieve model-specific interpretability, the attention mechanism of TabNet architecture was used, which offered the local and global model explanations on the influence of the attributes on the outcomes. The model was further explained locally and globally using more robust model-agnostic LIME and SHAP eXplainable Artificial Intelligence (XAI) tools. The proposed model outperformed all benchmarked models by obtaining high accuracy of 92.2% and 99.4% using the Pima Indians diabetes dataset (PIDD) and the early-stage diabetes risk prediction dataset (ESDRPD), respectively. Based on the XAI results, it was clear that the most influential attribute for diabetes classification using PIDD and ESDRPD were Insulin and Polyuria, respectively. The feature importance values registered for insulin was 0.301 (PIDD) and for polyuria 0.206 was registered (ESDRPD). The high accuracy and ancillary interpretability of our objective model is expected to increase end-users trust and confidence in early-stage detection of diabetes.
Collapse
Affiliation(s)
- Lionel P Joseph
- School of Mathematics, Physics, and Computing, University of Southern Queensland, Springfield, QLD, 4300, Australia
| | - Erica A Joseph
- Umanand Prasad School of Medicine and Health Sciences, The University of Fiji, Saweni, Lautoka, Fiji
| | - Ramendra Prasad
- Department of Science, School of Science and Technology, The University of Fiji, Saweni, Lautoka, Fiji.
| |
Collapse
|