MUDES: Multilingual Detection of Offensive Spans

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.


Introduction
Offensive and impolite language are widespread in social media posts motivating a number of studies on automatically detecting the various types of offensive content (e.g. aggression (Kumar et al., 2018(Kumar et al., , 2020, cyber-bullying (Rosa et al., 2019), hate speech , etc.). Most previous work has focused on classifying full instances (e.g. posts, comments, documents) (e.g. offensive vs. not offensive) while the identification of the particular spans that make a text offensive has been mostly neglected.
Identifying offensive spans in texts is the goal of the SemEval-2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021). The organisers of this task argue that highlighting toxic spans in texts helps assisting human moderators (e.g. news portals moderators) and that this can be a first step in semi-automated content moderation. Finally, as we demonstrate in this paper, addressing offensive spans in texts will make the output of offensive language detection systems more interpretable thus allowing a more detailed linguistics analysis of predictions and improving the quality of such systems.
With these important points in mind, we developed MUDES: Multilingual Detection of Offen-WARNING: This paper contains text excerpts and words that are offensive in nature.
sive Spans. MUDES is a multilingual framework for offensive language detection focusing on text spans. The main contributions of this paper are the following: 1. We introduce MUDES, a new Python-based framework to identify offensive spans with state-of-the-art performance.
2. We release four pre-trained offensive language identification models: en-base, enlarge models which are capable of identifying offensive spans in English text. We also release Multilingual-base and Multilinguallarge models which are able to recognise offensive spans in languages other than English.
3. We release a Python Application Programming Interface (API) for developers who are interested in training more models and performing inference in the code level.
4. For general users and non-programmers, we release a user-friendly web-based User Interface (UI), which provides the functionality to input a text in multiple languages and to identify the offensive span in that text.

Related Work
Early approaches to offensive language identification relied on traditional machine learning classifiers (Dadvar et al., 2013) and later on neural networks combined with word embeddings (Majumder et al., 2018;Hettiarachchi and Ranasinghe, 2019). Transformer-based models like BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018) have been recently applied to offensive language detection achieving competitive scores (Wang et al., 2020;Ranasinghe and Hettiarachchi, 2020) in recent SemEval competitions such as HatEval (Basile et al., 2019) OffensEval . In terms of languages, the majority of studies on this topic deal with English (Malmasi and Zampieri,
[] You're just silly. [12,13,14,15,16]  2017; Yao et al., 2019;Ridenhour et al., 2020;Rosenthal et al., 2020) due to the the wide availability of language resources such as corpora and pre-trained models. In recent years, several studies have been published on identifying offensive content in other languages such as Arabic (Mubarak et al., 2020), Dutch (Tulkens et al., 2016), French (Chiril et al., 2019), Greek (Pitenis et al., 2020), Italian (Poletto et al., 2017), Portuguese (Fortuna et al., 2019), and Turkish (Çöltekin, 2020). Most of these studies have created new datasets and resources for these languages opening avenues for multilingual models as those presented in Ranasinghe and . However, all studies presented in this section focused on classifying full texts, as discussed in the Introduction. MUDES' objective is to fill this gap and perform span level offensive language identification.

Data
The main dataset used to train the machine learning models presented in this paper is the dataset released within the scope of the aforementioned SemEval-2021 Task 5: Toxic Spans Detection for English. The dataset contains posts (comments) from the publicly available Civil Comments dataset (Borkan et al., 2019). The organisers have randomly selected 10,000 posts, out of a total of 1,2 million posts in the original dataset. The offensive spans have been annotated using a crowdannotation platform, employing three crowd-raters per post. By the time of writing this paper, only the trial set and the training set have been released and the gold labels for the test set have not yet been released. Therefore, training of the machine learning models presented in MUDES was done on the training set which we refer to as TSDTrain and the evaluation was conducted on the trial set which we refer to as TSDTrial set. In Table 1 we show four randomly selected examples from the TSDTrain dataset with their annotations.
The general idea is to learn a robust model from this dataset and generalize to other English datasets which do not contain span annotation. Another goal is to investigate the feasibility of annotation projection to other languages.
Other Datasets In order to evaluate our framework in different domains and languages we used three publicly available offensive language identification datasets. As an off-domain English dataset, we choose the Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019a), used in Of-fensEval 2019 (SemEval-2019 Task 6) (Zampieri et al., 2019b), containing over 14,000 posts from Twitter. To evaluate our framework in different languages, we selected a Danish (Sigurbergsson and Derczynski, 2020) and a Greek (Pitenis et al., 2020) dataset. These two datasets have been provided by the organisers of OffensEval 2020 (SemEval-2020 Task 12)  and were annotated using OLID's annotation guidelines. The Danish dataset contains over 3,000 posts from Facebook and Reddit while the Greek dataset contains over 10,000 Twitter posts, allowing us to evaluate our dataset in an off-domain, multilingual setting. As these three datasets have been annotated at the instance level, we followed an evaluation process explained in Section 5.

Methodology
The main motivation behind this methodology is the recent success that transformer models had in various NLP tasks (Devlin et al., 2019) including offensive language identification (Ranasinghe and Ranasinghe et al., 2019;Wiedemann et al., 2020). Most of these transformer-based approaches take the final hidden state of the first token ([CLS]) from the transformer as the representation of the whole sequence and a simple softmax classifier is added to the top of the transformer model to predict the probability of a class label (Sun et al., 2019). However, as previously men- tioned, these models classify whole comments or documents and do not identify the spans that make a text offensive. Since the objective of this task is to identify offensive spans rather than classifying the whole comment, we followed a different architecture.
As shown in Figure 1, the complete architecture contains two main parts; Language Modeling (LM) and Token Classification (TC). In the LM part, we used a pre-trained transformer model and retrained it on the TSDTrain dataset using Masked Language Modeling (MLM). In the second part of the architecture, we used the saved model from the LM part and we perform a token classification. We added a token level classifier on top of the transformer model as shown in Figure 1. The token-level classifier is a linear layer that takes the last hidden state of the sequence as the input and produce a label for each token as the output. In this case each token can have two labels; offensive and not offensive. We have listed the training configurations in the Appendix.
We experimented with several popular transformer models like BERT (Devlin et al., 2019), XLNET , ALBERT (Lan et al., 2020), RoBERTa (Liu et al., 2019) etc. From the pre-trained transformer models we selected, we grouped the large models and base models separately in order to release two English models. A large model; en-large which is more accurate, but has a low efficiency regarding space and time. The base model; en-base is efficient, but has a comparatively low accuracy than the en-large model. All the experiments have been executed for five times with different random seeds and we took the mode of the classes predicted by each random seed as the final result (Hettiarachchi and Ranasinghe, 2020).
Multilingual models -The motivation behind the use of multilingual models comes from recent works (Ranasinghe and Zampieri, , 2021 which used transfer learning and cross-lingual embeddings. These studies show that cross-lingual transformers like XLM-R (Conneau et al., 2019) can be trained on an English dataset and have the model weights saved to detect offensive language in other languages outperforming monolingual models trained on the target language dataset. We used a similar methodology but for the token classification architecture instead. We used XLM-R cross-lingual transformer model (Conneau et al., 2019) as the Transformer in Figure 1 on TSDTrain and carried out evaluations on the Danish and Greek datasets. We release two multilingual models; multilingual-base based on XLM-R base model and multilingual-large based on XLM-R large model.

Evaluation and Results
We followed two different evaluation methods. In Section 5.1 we present the methods used to evaluate offensive spans on the TSDTrial set. In Section 5.2 we presented the methods used to evaluate the other three datasets which only contained post level annotations.

Offensive Spans Evaluation
For the Toxic Spans Detection dataset, we followed the same evaluation procedure of the SemEval Toxic Spans Detection competition. The organisers have used F1 score mentioned in Da San Martino et al. (2019) to evaluate the systems. Let system A i return a set S t A i of character offsets, for parts of the post found to be toxic. Let G t be the character offsets of the ground truth annotations of t. We compute the F1 score of system A i with respect to the ground truth G for post t as mentioned in Equation 1, where | ·| denotes set cardinality.
We present the results along with the baseline provided by the organisers in Table 2. The baseline is implemented using a spaCy NER pipeline. The spaCy NER system contains a word embedding strategy using sub word features and Bloom embedding (Serrà and Karatzoglou, 2017), and a deep convolution neural network with residual connections. Additionally, we compare our results to a lexicon-based word match approach mentioned in Ranasinghe et al. (2021) where the lexicon is based on profanity words from online resources 1,2 .  The results show that all MUDES' models outperform the spaCy baseline and the lexicon-based word match. From all of the large transformer models we experimented roberta-large performed better than others. Therefore, we released it as en-large model in MUDES. From the base models we experimented, XLNet-base-cased model outperformed all the other base models so we released it as enbase model. We also released two multilingual models; multilingual-base and multilingual-large based on XLM-R-base and XLM-R-large respectively. All the pre-trained MUDES' models are available to download from HuggingFace model hub 3 (Wolf et al., 2020).

Off-Domain and Multilingual Evaluation
For the English off-domain and multilingual datasets we followed a different evaluation process. We used a pre-trained MUDES' model trained on TSDTrain to predict the offensive spans for all texts in the test sets of two non-English datasets (Danish, and Greek) and English off-domain dataset, OLID, which is annotated at the document level. If a certain text contains at least one offensive span we marked the whole text as offensive following the OLID annotation guidelines described in Zampieri et al. (2019a). We compared our results to the best systems submitted to OffensEval 2020 in terms of macro F1 reported by the task organisers . We present the results along with the majority class baseline for each dataset in Table  3. For English off domain dataset (OLID) we only used the MUDES en models while for Danish and Greek datasets we used the MUDES multilingual models.

1: $ p i p i n s t a l l mudes
The Python package contains the following functionalities.

Get offensive spans with a pretrained model
The library provides the functionality to load a pretrained model and use it to identify offensive spans. The following code segment downloads and loads MUDES' en-base model in a CPU only environment and identifies offensive spans in the the text; "This is fucking crazy!!". If the users prefer a GPU, the argument use_cuda should be set to True.

User Interface
We developed a prototype of the User Interface (UI) to demonstrate the capabilities of the system. The UI is based on Streamlit 7 which provides functionalities to easily develop dashboards for machine learning projects. The code base for the UI is available in GitHub 8 . This UI is hosted in a Linux server. 9 We also release a Docker container image of the UI in Docker Hub 10 for those who are interested in self hosting the UI. Docker enables developers to easily deploy and run any application as a lightweight, portable, self-sufficient container, which can run virtually anywhere. The released Docker container image follows Continuous Integration/Continuous Deployment (CI/CD) from the GitHub repository which allows sharing and deploying the code quickly and efficiently.
Once Docker is installed, one can easily run our UI with this command.

1: $ d o c k e r r u n t h a r i n d u d r / mudes
This command will automatically install all the required packages, download and load the pre-trained models and open the system in the default browser. We provide the following functionalities from the user interface.
Switch through pretrained models -The users can switch through the pre-trained models using the radio buttons available in the left side of the UI under Available Models section. They can select an option from en-base, en-large, multilingual-base and multilingual-large. These models have been already downloaded from the HuggingFace model hub and they are loaded in to the random-access memory of the hosting computer.
Switch through available datasets -We have made the four datsets used in this paper available from the UI for the users to experiment with (Borkan et al., 2019;Zampieri et al., 2019a;Pitenis et al., 2020;Sigurbergsson and Derczynski, 2020).
Once the user selects a particular option, the system will automatically load the test set of the selected dataset. Once it is loaded the user can iterate through the dataset using the scrollbar. For each text the UI will display the offensive spans in red.
Get offensive spans for a custom text -The users can also enter a custom text in the text box, hit ctrl+enter and see the offensive spans available in the input text. Once processed through the system, any offensive spans available in the text will be displayed in red. Figure 2 shows several screenshots from the UI. It illustrates an example on English for the texts taken from civil comments dataset (Borkan et al., 2019) conducted with enlarge model. To show that MUDES framework works on low resource language too, Figure 2 also displays an example from Tamil.

System Efficiency
The time taken to predict the offensive spans for a text will be critical in an online system developed for real time use. Therefore, we evaluated the time MUDES takes to predict the offensive spans in 100 texts for all the released models in a CPU and GPU environment. The results show that large models take around 3 seconds for a sentence in a CPU and take around 1 second for a sentence in a GPU on average while the base models take approximately one third of that time in both environments. From these results it is clear that MUDES is capable of predicting toxic spans efficiently in any environment. The full set of results are reported in the Appendix. We used a batch size of one, in order to mimic the real world scenario. The full specifications of the CPU and GPU environments are listed in the Appendix. Appendix i Training Configurations We used an Nvidia Tesla K80 GPU to train the models. We divided the dataset into a training set and a validation set using 0.8:0.2 split. We fine tuned the learning rate and number of epochs of the model manually to obtain the best results for the validation set. We obtained 1e − 5 as the best value for learning rate and 3 as the best value for number of epochs for all the languages. We performed early stopping if the validation loss did not improve over 10 evaluation steps. Training large models took around 30 minutes while training base models took around 10 minutes. In addition to the learning rate and number of epochs we used the parameter values mentioned in Table 4. We kept these values as constants.

Conclusion
Parameter Value adam epsilon 1e-8 warmup ratio 0.1 warmup steps 0 max grad norm 1.0 max seq. length 140 gradient accumulation steps 1 ii Hardware Specifications In Table 5 and in Table 6 we mention the specifications of the GPU and CPU we used for the experiments of the paper. For the training of the MUDES models, we mainly used the GPU. For the efficiency experiments mentioned in Section 6.3 we used both GPU and CPU environments.  iii Run time As expected base models perform efficiently than the large models in both environments. Large models take around 3 seconds for a sentence in a CPU and take around 1 second for a

Ethics Statement
MUDES is essentially a web-based visualization tool with predictive models trained on multiple publicly available datasets. The authors of this paper used datasets referenced in this paper which were previously collected and annotated. No new data collection has been carried out as part of this work. We have not collected or processed writers'/users' information nor have we carried out any form of user profiling protecting users' privacy and identity. We understand that every dataset is subject to intrinsic bias and that computational models will inevitably learn biased information from any dataset. We believe that MUDES will help coping with biases in datasets and models as it features: (1) a freely available Python library that other researchers can use to train new models on other datasets; (2) a web-based visualizing tool that can help efforts in reducing biases in offensive language identification as they can be used to process and visualize potentially offensive spans new data. Finally, unlike models trained at the post level, the projected annotation of spans allows users to understand which part of the instance is considered offensive by the models.