1
|
García-Baena D, García-Cumbreras MÁ, Jiménez-Zafra SM, García-Díaz JA, Valencia-García R. Hope speech detection in Spanish: The LGBT case. LANG RESOUR EVAL 2023:1-28. [PMID: 37360265 PMCID: PMC10022560 DOI: 10.1007/s10579-023-09638-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/13/2023] [Indexed: 03/19/2023]
Abstract
In recent years, systems have been developed to monitor online content and remove abusive, offensive or hateful content. Comments in online social media have been analyzed to find and stop the spread of negativity using methods such as hate speech detection, identification of offensive language or detection of abusive language. We define hope speech as the type of speech that is able to relax a hostile environment and that helps, gives suggestions and inspires for good to a number of people when they are in times of illness, stress, loneliness or depression. Detecting it automatically, in order to give greater diffusion to positive comments, can have a very significant effect when it comes to fighting against sexual or racial discrimination or when we intend to foster less bellicose environments. In this article we perform a complete study on hope speech, analyzing existing solutions and available resources. In addition, we have generated a quality resource, SpanishHopeEDI, a new Spanish Twitter dataset on LGBT community, and we have conducted some experiments that can serve as a baseline for further research.
Collapse
|
2
|
Ong SQ, Pauzi MBM, Gan KH. Text mining in mosquito-borne disease: A systematic review. Acta Trop 2022; 231:106447. [PMID: 35430265 PMCID: PMC9663275 DOI: 10.1016/j.actatropica.2022.106447] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 01/09/2023]
Abstract
Mosquito-borne diseases are emerging and re-emerging across the globe, especially after the COVID19 pandemic. The recent advances in text mining in infectious diseases hold the potential of providing timely access to explicit and implicit associations among information in the text. In the past few years, the availability of online text data in the form of unstructured or semi-structured text with rich content of information from this domain enables many studies to provide solutions in this area, e.g., disease-related knowledge discovery, disease surveillance, early detection system, etc. However, a recent review of text mining in the domain of mosquito-borne disease was not available to the best of our knowledge. In this review, we survey the recent works in the text mining techniques used in combating mosquito-borne diseases. We highlight the corpus sources, technologies, applications, and the challenges faced by the studies, followed by the possible future directions that can be taken further in this domain. We present a bibliometric analysis of the 294 scientific articles that have been published in Scopus and PubMed in the domain of text mining in mosquito-borne diseases, from the year 2016 to 2021. The papers were further filtered and reviewed based on the techniques used to analyze the text related to mosquito-borne diseases. Based on the corpus of 158 selected articles, we found 27 of the articles were relevant and used text mining in mosquito-borne diseases. These articles covered the majority of Zika (38.70%), Dengue (32.26%), and Malaria (29.03%), with extremely low numbers or none of the other crucial mosquito-borne diseases like chikungunya, yellow fever, West Nile fever. Twitter was the dominant corpus resource to perform text mining in mosquito-borne diseases, followed by PubMed and LexisNexis databases. Sentiment analysis was the most popular technique of text mining to understand the discourse of the disease and followed by information extraction, which dependency relation and co-occurrence-based approach to extract relations and events. Surveillance was the main usage of most of the reviewed studies and followed by treatment, which focused on the drug-disease or symptom-disease association. The advance in text mining could improve the management of mosquito-borne diseases. However, the technique and application posed many limitations and challenges, including biases like user authentication and language, real-world implementation, etc. We discussed the future direction which can be useful to expand this area and domain. This review paper contributes mainly as a library for text mining in mosquito-borne diseases and could further explore the system for other neglected diseases.
Collapse
Affiliation(s)
- Song-Quan Ong
- Institute for Tropical Biology and Conservation, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, Sabah 88400, Malaysia,Corresponding author
| | | | - Keng Hoon Gan
- School of Computer Sciences, Universiti Sains Malaysia, Penang 11800, Malaysia
| |
Collapse
|
3
|
García-Díaz JA, Jiménez-Zafra SM, García-Cumbreras MA, Valencia-García R. Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00693-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
AbstractThe rise of social networks has allowed misogynistic, xenophobic, and homophobic people to spread their hate-speech to intimidate individuals or groups because of their gender, ethnicity or sexual orientation. The consequences of hate-speech are devastating, causing severe depression and even leading people to commit suicide. Hate-speech identification is challenging as the large amount of daily publications makes it impossible to review every comment by hand. Moreover, hate-speech is also spread by hoaxes that requires language and context understanding. With the aim of reducing the number of comments that should be reviewed by experts, or even for the development of autonomous systems, the automatic identification of hate-speech has gained academic relevance. However, the reliability of automatic approaches is still limited specifically in languages other than English, in which some of the state-of-the-art techniques have not been analyzed in detail. In this work, we examine which features are most effective in identifying hate-speech in Spanish and how these features can be combined to develop more accurate systems. In addition, we characterize the language present in each type of hate-speech by means of explainable linguistic features and compare our results with state-of-the-art approaches. Our research indicates that combining linguistic features and transformers by means of knowledge integration outperforms current solutions regarding hate-speech identification in Spanish.
Collapse
|
4
|
García-Díaz JA, Valencia-García R. Compilation and evaluation of the Spanish SatiCorpus 2021 for satire identification using linguistic features and transformers. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00625-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractSatirical content on social media is hard to distinguish from real news, misinformation, hoaxes or propaganda when there are no clues as to which medium these news were originally written in. It is important, therefore, to provide Information Retrieval systems with mechanisms to identify which results are legitimate and which ones are misleading. Our contribution for satire identification is twofold. On the one hand, we release the Spanish SatiCorpus 2021, a balanced dataset that contains satirical and non-satirical documents. On the other hand, we conduct an extensive evaluation of this dataset with linguistic features and embedding-based features. All feature sets are evaluated separately and combined using different strategies. Our best result is achieved with a combination of the linguistic features and BERT with an accuracy of 97.405%. Besides, we compare our proposal with existing datasets in Spanish regarding satire and irony.
Collapse
|
5
|
Kakulapati V, Reddy SM, Kumar N. Lexical modeling and weighted matrices for analyses of COVID-19 outbreak. LESSONS FROM COVID-19 2022. [PMCID: PMC9347367 DOI: 10.1016/b978-0-323-99878-9.00005-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The most dangerous and infectious disease, COVID-19, affecting millions of people is by an enveloped RNA virus known as SARS-COV-2 or Coronavirus, and the disease is unknown before the epidemic commenced in Wuhan, China, in December 2019. Many researchers are busy finding the vaccine for the pandemic. Here, we analyze the diagnostic methods by using mathematical modeling. The majority probable corona patient category with an enhanced AUC characterizes the SVM’s optimal diagnostics model in this chapter. Experimental and computational analyses demonstrate that the diagnosis of potentially COVID-19 can be supported by adopting ML algorithms that learn linguistic diagnostics from the interpretation of elderly persons. Highlight the collection of significant semantic, lexical, and top n-gram properties with the better ML method to estimate diseases. But diagnostics methods must be trained on massive datasets, leading to improved AUC and medical diagnoses of COVID-19 probability. A significant use resulting from mathematical modeling is that it claims transparency and accurateness about our model. These techniques can help in decision-making by useful predictions about substantial issues such as treatment protocols and interfere and minimize the spread of COVID-19.
Collapse
|
6
|
Singh A, Jenamani M, Thakkar J, Dwivedi YK. A Text Analytics Framework for Performance Assessment and Weakness Detection From Online Reviews. JOURNAL OF GLOBAL INFORMATION MANAGEMENT 2021. [DOI: 10.4018/jgim.304069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Present research proposes a framework that integrates aspect-level sentiment analysis with multi-criteria decision making (TOPSIS) and control charts to uncover hidden quality patterns. While sentiment analysis quantifies consumer opinions corresponding to various product features, TOPSIS uses the sentiment scores to rank manufacturers based on their relative performance. Finally, U and P control charts assist in discovering the weak aspects and corresponding attributes. To extract aspect-level sentiments from reviews, we developed the ontology of passenger cars and designed a heuristic that connects the opinion-bearing texts to the exact automobile attribute. The proposed framework was applied to a review dataset collected from a well-known car portal in India. Considering five manufacturers from the mid-size car segment, we identified the weakest and discovered the aspects and attributes responsible for its perceived weakness.
Collapse
Affiliation(s)
- Amit Singh
- Indian Institute of Technology Jodhpur, India
| | | | - Jitesh Thakkar
- National Rail and Transportation Institute, Vadodara, India
| | | |
Collapse
|
7
|
Alexandridis G, Aliprantis J, Michalakis K, Korovesis K, Tsantilas P, Caridakis G. A Knowledge-Based Deep Learning Architecture for Aspect-Based Sentiment Analysis. Int J Neural Syst 2021; 31:2150046. [PMID: 34435942 DOI: 10.1142/s0129065721500465] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The task of sentiment analysis tries to predict the affective state of a document by examining its content and metadata through the application of machine learning techniques. Recent advances in the field consider sentiment to be a multi-dimensional quantity that pertains to different interpretations (or aspects), rather than a single one. Based on earlier research, the current work examines the said task in the framework of a larger architecture that crawls documents from various online sources. Subsequently, the collected data are pre-processed, in order to extract useful features that assist the machine learning algorithms in the sentiment analysis task. More specifically, the words that comprise each text are mapped to a neural embedding space and are provided to a hybrid, bi-directional long short-term memory network, coupled with convolutional layers and an attention mechanism that outputs the final textual features. Additionally, a number of document metadata are extracted, including the number of a document's repetitions in the collected corpus (i.e. number of reposts/retweets), the frequency and type of emoji ideograms and the presence of keywords, either extracted automatically or assigned manually, in the form of hashtags. The novelty of the proposed approach lies in the semantic annotation of the retrieved keywords, since an ontology-based knowledge management system is queried, with the purpose of retrieving the classes the aforementioned keywords belong to. Finally, all features are provided to a fully connected, multi-layered, feed-forward artificial neural network that performs the analysis task. The overall architecture is compared, on a manually collected corpus of documents, with two other state-of-the-art approaches, achieving optimal results in identifying negative sentiment, which is of particular interest to certain parties (like for example, companies) that are interested in measuring their online reputation.
Collapse
Affiliation(s)
- Georgios Alexandridis
- Cultural Technology Department, University of the Aegean, University Hill, Mytilene 81100, Greece
| | - John Aliprantis
- Cultural Technology Department, University of the Aegean, University Hill, Mytilene 81100, Greece
| | - Konstantinos Michalakis
- Cultural Technology Department, University of the Aegean, University Hill, Mytilene 81100, Greece
| | | | | | - George Caridakis
- Cultural Technology Department, University of the Aegean, University Hill, Mytilene 81100, Greece
| |
Collapse
|
8
|
Automatic Correction of Real-Word Errors in Spanish Clinical Texts. SENSORS 2021; 21:s21092893. [PMID: 33919018 PMCID: PMC8122440 DOI: 10.3390/s21092893] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Revised: 04/16/2021] [Accepted: 04/19/2021] [Indexed: 11/17/2022]
Abstract
Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results.
Collapse
|
9
|
Gbashi S, Adebo OA, Doorsamy W, Njobeh PB. Systematic Delineation of Media Polarity on COVID-19 Vaccines in Africa: Computational Linguistic Modeling Study. JMIR Med Inform 2021; 9:e22916. [PMID: 33667172 PMCID: PMC7968413 DOI: 10.2196/22916] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 10/20/2020] [Accepted: 12/08/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND The global onset of COVID-19 has resulted in substantial public health and socioeconomic impacts. An immediate medical breakthrough is needed. However, parallel to the emergence of the COVID-19 pandemic is the proliferation of information regarding the pandemic, which, if uncontrolled, cannot only mislead the public but also hinder the concerted efforts of relevant stakeholders in mitigating the effect of this pandemic. It is known that media communications can affect public perception and attitude toward medical treatment, vaccination, or subject matter, particularly when the population has limited knowledge on the subject. OBJECTIVE This study attempts to systematically scrutinize media communications (Google News headlines or snippets and Twitter posts) to understand the prevailing sentiments regarding COVID-19 vaccines in Africa. METHODS A total of 637 Twitter posts and 569 Google News headlines or descriptions, retrieved between February 2 and May 5, 2020, were analyzed using three standard computational linguistics models (ie, TextBlob, Valence Aware Dictionary and Sentiment Reasoner, and Word2Vec combined with a bidirectional long short-term memory neural network). RESULTS Our findings revealed that, contrary to general perceptions, Google News headlines or snippets and Twitter posts within the stated period were generally passive or positive toward COVID-19 vaccines in Africa. It was possible to understand these patterns in light of increasingly sustained efforts by various media and health actors in ensuring the availability of factual information about the pandemic. CONCLUSIONS This type of analysis could contribute to understanding predominant polarities and associated potential attitudinal inclinations. Such knowledge could be critical in informing relevant public health and media engagement policies.
Collapse
Affiliation(s)
- Sefater Gbashi
- Faculty of Science, University of Johannesburg, Johannesburg, South Africa
| | | | - Wesley Doorsamy
- Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
| | | |
Collapse
|
10
|
Comparing Deep-Learning Architectures and Traditional Machine-Learning Approaches for Satire Identification in Spanish Tweets. MATHEMATICS 2020. [DOI: 10.3390/math8112075] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Automatic satire identification can help to identify texts in which the intended meaning differs from the literal meaning, improving tasks such as sentiment analysis, fake news detection or natural-language user interfaces. Typically, satire identification is performed by training a supervised classifier for finding linguistic clues that can determine whether a text is satirical or not. For this, the state-of-the-art relies on neural networks fed with word embeddings that are capable of learning interesting characteristics regarding the way humans communicate. However, as far as our knowledge goes, there are no comprehensive studies that evaluate these techniques in Spanish in the satire identification domain. Consequently, in this work we evaluate several deep-learning architectures with Spanish pre-trained word-embeddings and compare the results with strong baselines based on term-counting features. This evaluation is performed with two datasets that contain satirical and non-satirical tweets written in two Spanish variants: European Spanish and Mexican Spanish. Our experimentation revealed that term-counting features achieved similar results to deep-learning approaches based on word-embeddings, both outperforming previous results based on linguistic features. Our results suggest that term-counting features and traditional machine learning models provide competitive results regarding automatic satire identification, slightly outperforming state-of-the-art models.
Collapse
|