1
|
Mnassri K, Farahbakhsh R, Chalehchaleh R, Rajapaksha P, Jafari AR, Li G, Crespi N. A survey on multi-lingual offensive language detection. PeerJ Comput Sci 2024; 10:e1934. [PMID: 38660178 PMCID: PMC11042037 DOI: 10.7717/peerj-cs.1934] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 02/18/2024] [Indexed: 04/26/2024]
Abstract
The prevalence of offensive content on online communication and social media platforms is growing more and more common, which makes its detection difficult, especially in multilingual settings. The term "Offensive Language" encompasses a wide range of expressions, including various forms of hate speech and aggressive content. Therefore, exploring multilingual offensive content, that goes beyond a single language, focus and represents more linguistic diversities and cultural factors. By exploring multilingual offensive content, we can broaden our understanding and effectively combat the widespread global impact of offensive language. This survey examines the existing state of multilingual offensive language detection, including a comprehensive analysis on previous multilingual approaches, and existing datasets, as well as provides resources in the field. We also explore the related community challenges on this task, which include technical, cultural, and linguistic ones, as well as their limitations. Furthermore, in this survey we propose several potential future directions toward more efficient solutions for multilingual offensive language detection, enabling safer digital communication environment worldwide.
Collapse
Affiliation(s)
- Khouloud Mnassri
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Reza Farahbakhsh
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Razieh Chalehchaleh
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Praboda Rajapaksha
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Amir Reza Jafari
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Guanlin Li
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| | - Noel Crespi
- Samovar, Telecom SudParis, Institut Polytechnique de Paris, Palaiseau, France
| |
Collapse
|
2
|
Mekacher A, Falkenberg M, Baronchelli A. The systemic impact of deplatforming on social media. PNAS NEXUS 2023; 2:pgad346. [PMID: 37954163 PMCID: PMC10638500 DOI: 10.1093/pnasnexus/pgad346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 10/06/2023] [Indexed: 11/14/2023]
Abstract
Deplatforming, or banning malicious accounts from social media, is a key tool for moderating online harms. However, the consequences of deplatforming for the wider social media ecosystem have been largely overlooked so far, due to the difficulty of tracking banned users. Here, we address this gap by studying the ban-induced platform migration from Twitter to Gettr. With a matched dataset of 15M Gettr posts and 12M Twitter tweets, we show that users active on both platforms post similar content as users active on Gettr but banned from Twitter, but the latter have higher retention and are 5 times more active. Our results suggest that increased Gettr use is not associated with a substantial increase in user toxicity over time. In fact, we reveal that matched users are more toxic on Twitter, where they can engage in abusive cross-ideological interactions, than Gettr. Our analysis shows that the matched cohort are ideologically aligned with the far-right, and that the ability to interact with political opponents may be part of Twitter's appeal to these users. Finally, we identify structural changes in the Gettr network preceding the 2023 Brasília insurrections, highlighting the risks that poorly regulated social media platforms may pose to democratic life.
Collapse
Affiliation(s)
- Amin Mekacher
- Department of Mathematics, City University of London, London EC1V 0HB, UK
| | - Max Falkenberg
- Department of Mathematics, City University of London, London EC1V 0HB, UK
| | - Andrea Baronchelli
- Department of Mathematics, City University of London, London EC1V 0HB, UK
- The Alan Turing Institute, British Library, London NW1 2DB, UK
| |
Collapse
|
3
|
Mattei M, Pratelli M, Caldarelli G, Petrocchi M, Saracco F. Bow-tie structures of twitter discursive communities. Sci Rep 2022; 12:12944. [PMID: 35902625 PMCID: PMC9332050 DOI: 10.1038/s41598-022-16603-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Accepted: 07/12/2022] [Indexed: 11/23/2022] Open
Abstract
Bow-tie structures were introduced to describe the World Wide Web (WWW): in the direct network in which the nodes are the websites and the edges are the hyperlinks connecting them, the greatest number of nodes takes part to a bow-tie, i.e. a Weakly Connected Component (WCC) composed of 3 main sectors: IN, OUT and SCC. SCC is the main Strongly Connected Component of WCC, i.e. the greatest subgraph in which each node is reachable by any other one. The IN and OUT sectors are the set of nodes not included in SCC that, respectively, can access and are accessible to nodes in SCC. In the WWW, the greatest part of the websites can be found in the SCC, while the search engines belong to IN and the authorities, as Wikipedia, are in OUT. In the analysis of Twitter debate, the recent literature focused on discursive communities, i.e. clusters of accounts interacting among themselves via retweets. In the present work, we studied discursive communities in 8 different thematic Twitter datasets in various languages. Surprisingly, we observed that almost all discursive communities therein display a bow-tie structure during political or societal debates. Instead, they are absent when the argument of the discussion is different as sport events, as in the case of Euro2020 Turkish and Italian datasets. We furthermore analysed the quality of the content created in the various sectors of the different discursive communities, using the domain annotation from the fact-checking website Newsguard: we observe that, when the discursive community is affected by m/disinformation, the content with the lowest quality is the one produced and shared in SCC and, in particular, a strong incidence of low- or non-reputable messages is present in the flow of retweets between the SCC and the OUT sectors. In this sense, in discursive communities affected by m/disinformation, the greatest part of the accounts has access to a great variety of contents, but whose quality is, in general, quite low; such a situation perfectly describes the phenomenon of infodemic, i.e. the access to "an excessive amount of information about a problem, which makes it difficult to identify a solution", according to WHO.
Collapse
Affiliation(s)
- Mattia Mattei
- IMT School For Advanced Studies Lucca, p.zza San Francesco 19, 55100, Lucca, Italy
- Alephsys Lab, Universitat Rovira i Virgili, Av. Paisos Catalans 26, 43007, Tarragona, Catalonia, Spain
| | - Manuel Pratelli
- IMT School For Advanced Studies Lucca, p.zza San Francesco 19, 55100, Lucca, Italy
- Institute of Informatics and Telematics, National Research Council, via Moruzzi 1, 56124, Pisa, Italy
| | - Guido Caldarelli
- IMT School For Advanced Studies Lucca, p.zza San Francesco 19, 55100, Lucca, Italy
- Department of Molecular Sciences and Nanosystems, Ca' Foscari University of Venice, Ed. Alfa, Via Torino 155, 30170, Venezia Mestre, Italy
- European Centre for Living Technology (ECLT), Ca' Bottacin, 3911 Dorsoduro Calle Crosera, 30123, Venice, Italy
| | - Marinella Petrocchi
- IMT School For Advanced Studies Lucca, p.zza San Francesco 19, 55100, Lucca, Italy
- Institute of Informatics and Telematics, National Research Council, via Moruzzi 1, 56124, Pisa, Italy
| | - Fabio Saracco
- IMT School For Advanced Studies Lucca, p.zza San Francesco 19, 55100, Lucca, Italy.
- Institute for Applied Mathematics "Mauro Picone", National Research Council, via dei Taurini 19, 00185, Rome, Italy.
- "Enrico Fermi" Research Center, via Panisperna 89 A, 00184, Rome, Italy.
| |
Collapse
|
4
|
Alhayan F, Pennington D, Ayouni S. Twitter use by the dementia community during COVID-19: a user classification and social network analysis. ONLINE INFORMATION REVIEW 2022. [DOI: 10.1108/oir-04-2021-0208] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
PurposeThe study aimed to examine how different communities concerned with dementia engage and interact on Twitter.Design/methodology/approachA dataset was sampled from 8,400 user profile descriptions, which was labelled into five categories and subjected to multiple machine learning (ML) classification experiments based on text features to classify user categories. Social network analysis (SNA) was used to identify influential communities via graph-based metrics on user categories. The relationship between bot score and network metrics in these groups was also explored.FindingsClassification accuracy values were achieved at 82% using support vector machine (SVM). The SNA revealed influential behaviour on both the category and node levels. About 2.19% suspected social bots contributed to the coronavirus disease 2019 (COVID-19) dementia discussions in different communities.Originality/valueThe study is a unique attempt to apply SNA to examine the most influential groups of Twitter users in the dementia community. The findings also highlight the capability of ML methods for efficient multi-category classification in a crisis, considering the fast-paced generation of data.Peer reviewThe peer review history for this article is available at: https://publons.com/publon/10.1108/OIR-04-2021-0208.
Collapse
|
5
|
Evkoski B, Ljubešić N, Pelicon A, Mozetič I, Kralj Novak P. Evolution of topics and hate speech in retweet network communities. APPLIED NETWORK SCIENCE 2021; 6:96. [PMID: 34957317 PMCID: PMC8686097 DOI: 10.1007/s41109-021-00439-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 12/10/2021] [Indexed: 06/14/2023]
Abstract
Twitter data exhibits several dimensions worth exploring: a network dimension in the form of links between the users, textual content of the tweets posted, and a temporal dimension as the time-stamped sequence of tweets and their retweets. In the paper, we combine analyses along all three dimensions: temporal evolution of retweet networks and communities, contents in terms of hate speech, and discussion topics. We apply the methods to a comprehensive set of all Slovenian tweets collected in the years 2018-2020. We find that politics and ideology are the prevailing topics despite the emergence of the Covid-19 pandemic. These two topics also attract the highest proportion of unacceptable tweets. Through time, the membership of retweet communities changes, but their topic distribution remains remarkably stable. Some retweet communities are strongly linked by external retweet influence and form super-communities. The super-community membership closely corresponds to the topic distribution: communities from the same super-community are very similar by the topic distribution, and communities from different super-communities are quite different in terms of discussion topics. However, we also find that even communities from the same super-community differ considerably in the proportion of unacceptable tweets they post.
Collapse
Affiliation(s)
- Bojan Evkoski
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
- Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Nikola Ljubešić
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
- Faculty of Information and Communication Sciences, University of Ljubljana, Ljubljana, Slovenia
| | - Andraž Pelicon
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
- Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Igor Mozetič
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
| | - Petra Kralj Novak
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
- Central European University, Vienna, Austria
| |
Collapse
|