TUVD team at SemEval-2019 Task 6: Offense Target Identification

This article presents our approach for detecting a target of offensive messages in Twitter, including Individual, Group and Others classes. The model we have created is an ensemble of simpler models, including Logistic Regression, Naive Bayes, Support Vector Machine and the interpolation between Logistic Regression and Naive Bayes with 0.25 coefficient of interpolation. The model allows us to achieve 0.547 macro F1-score.


Introduction
Nowadays aggressive language on social media occurs more and more often. Categories of hate speech can be very diverse and can deal with a wide range of issues such as misogyny, sexual orientation, religion and immigration.
Such types of speech can be found in posts in social networks, in Internet discussions, in comments on various articles and in responses to posts of famous persons. This problem is receiving increasing amounts of attention and researchers are making attempts to build systems capable of recognizing such kinds of aggressive speech, offenses and insults in social networks.
This article presented our approach to hate speech detection, which we used for the challenge SemEval-2019 Task 6: OffensEval -Identifying and Categorizing Offensive Language in Social Media (Zampieri et al., 2019a), (Zampieri et al., 2019b).
The task consisted of three sub-tasks and proposed to investigate the data extracted from Twitter for creating a classification system. Sub-task A had the aim to identify offensive language and there were 860 unmarked English tweets for testing. The post had to be non offensive if it did not contain any offense or profanity.
The main goal of the Sub-task B was to categorize offensive posts from Sub-task A (there were 240 English tweets for testing) to different offensive types: -Targeted Insults and Threats in cases when a post insults or treats to an individual, a group or an organization; -Untargeted in cases where a post has a non-targeting profanity and swearing.
Sub-task C focused on offense target identification. There were 213 English tweets which were marked as offensive in Sub-task A and Targeted Insult and Threats in Sub-task B for testing. The classification was for three different groups: -Individual, when the target of the offensive post was a person; -Group, when the target of the offensive message was a group of people considered as a unit; -Other, when the target of the offensive tweet did not belong to any of the previous categories (e.g., a situation, an event, or another issue).
There are two datasets in English and in Spanish languages for analysis, and our team worked with English only. The training dataset included 13200 tweets, 4400 of them were offensive ones, 3876 messages were labeled as 'Target Insult and Threats' and 524 ones as 'Untargeted'. We focused our efforts on Sub-task C only, and the training dataset for it consisted of 2407 'individual' offensive posts, 1074 'group' ones and 395 tweets marked as 'other'.
The paper is organized as follows. Some relevant related works in the area are described in Section 2. Section 3 presents the preprocessing we applied for the dataset and the methodology we used for the model creating. In Section 4 the results are described and analyzed. In Section 5 we summarize our work and plan some steps for the future researches.

Related Work
Today there are a lot of promising works in the area of the hate speech recognition As was shown in (Fasoli et al., 2015), offensive language can be very diverse and the level of the messages offensiveness can depend on the context and the relationships between users who take part in the conversation.
For example, insults delivered in a sexual context are less offensive in cases where there is a conversation between partners. Some slurs have more offensive meaning in cases of conversations between a superior and a subordinate compared with conversations between friends and some groups of slurs are more acceptable then others.
Expanding the point that offensive speech is heterogeneous, the work (Clarke and Grieve, 2017) presented results which showed that there is a difference between racist and sexist posts: the sexist messages were more interactive (more personal) and more attitudinal (with authors opinion) than racist ones. From this article we can make a conclusion that the most popular linguistic feature in offensive language are question marks and question DO (when a sentence stars with the word do).
The work (Saleem et al., 2017) demonstrated that messages may not include slurs, but still be offensive.
The authors took as training dataset messages from potentially vulnerable communities (like groups of Afro-American and plus-size users) and messages from haters of these communities (not included slurs only) and showed that the system of hate speech recognition based on traditional methods like Logistic Regression could indicate insult meaning on the posts without slurs. .
In addition, this work shows that it is possible to test dataset from one source using training set from another one. Authors checked this fact, used the training dataset from one source and the testing dataset from the another source. The results were quite good and it is allow us to say that it could be useful to add to our training dataset some comments from another social media to make predictions better.
At the Automated Misogyny Detection (AMI) Shared Tasks IBEREVAL-2018 (Fersini et al., 2018) and EVALITA-2018 (Caselli et al., 2018), some interesting approaches for offensive language detecting were presented. The main goal in these challenges was to detect misogynistic tweets and to classify tweets for different groups depending on a misogyny type (stereotypes and objectification, dominance, derailing, sexual harassment and threats of violence and discredit) and an insults target (the idea of this type of classification was to recognize misogynous tweets which offend a specific person and tweets which insult a group of people).
In (Pamungkas et al., 2018) it was shown that the results of the model based on Support Vector Machine were quite good and in the research (Frenda et al., 2018) the ensemble of models allow to achieve a high level of accuracy. In work (Shushkevich and Cardiff, 2018) it was presented the ensemble of Logistic Regression. Support Vector Machine and Naive Bayes model which shown quite good results.
It is necessary to add that models based on neural networks show good results of offensive language recognition, as it was shown in (Badjatiya et al., 2017), where the authors created the model based on Long Short-Term Networks (LSTMs) which use internal memory for capture the long range dependencies in sentences and it could be important for the hate speech detection. This approach allowed them to achieve very high results in sexist and racist tweets detection in comparison with classifiers such as Logistic Regression, Random Forest, SVMs and Gradient Boosted Decision Trees (GBDTs).

Methodology and Data
As the preprocessing step we: -converted the words to the lower case; -used TF-IDF (Term Frequency -Inverse Document Frequency) for the vectorization; -marked emojis with the word 'EMOJI'; -labeled some combinations of symbols like '!!! ' and '??? ', because they look like emotional expressions and could be presented as emojis too, and replaced them with the word 'EMOJI' Our model presents an ensemble of some classic machine learning models: -The model based on Logistic Regression (LR) (Wright, 1995;Genkin et al., 2007), this type of classifiers apply an exponential function to a lineal combination of objects extracted from the data.
-The model based on Naive Bayes (NB), whose advantages are an absence of big training dataset and speed calculations requirement (Hi and Li, 2007).
-The model presented an interpolation between LR and NB with 0.25 coefficient of interpolation as a form of regulation: trust NB unless the LR. This type of interpolation was shown in (Wang and Manning, 2012) where NB was combined with Support Vector Machine, but in our case the combination LR+NB worked better.
-The model based on Support Vector Machine (SVM), the effectiveness of which in the work with texts was described in (Joachims, 2002).
We blended all above-described models into one which indicated the belonging of a tweet to the classes according to the rule: we summarized probabilities of belonging to all three classed which all four models presented and divided this number by 4. A post was assigned a class with the highest average probability.

Results
The predicted results of F1-macro for the all 5 models are presented in Table 1.
As it shown the Blended model achieves the highest score (0.68), so we could conclude that our hypothesis was correct and an ensemble of models presented the best results for the task of offense target identification. Also, the model which combine Logistic Regression and Naive Bayes achieves good result (0.65), and the worst model for this type of classification was Logistic Regression one.
The results of the challenge are presented in Table 2. Overall Accuracy for the test set was equal to 0.6478 and Macro-F1 was 0.547.
As we can see, the macro F1-score is less when predicted with the training dataset macro F1-score by 0.133, and this difference could be connected with the small number of tweets for training. Also it should be noted, that the results of classification have a strict correlation with the number of testing examples: the IND classifier works better then GRP one and much better then OTH classifier, because in the testing dataset there were more data about individual target of offenses then about group and other targets.

Conclusion
To sum up, we created an ensemble of models, which allow as to achieve quite good results being placed 25th out of a total of 65 participants. We showed that the idea of blending simple models based on Logistic Regression, Naive Bayes and Support Vector Machine gives a perspective in the area of hate speech recognition in the identification of the target of offensive messages.
As the next steps in our research, we are planning to expand the preprocessing step and use some dictionaries and lists of offensive language, which could help us to achieve better results. We also intend to additional data for the training datasets.
It is interesting to add, that in these datasets all links were replaced with URL and all usernames  in tweets were replaced with USER. It could be useful to investigate, for example, links, which were mentioned in offensive messages. It could be possible to expand our dataset in cases when link was a respond for another offensive post or we could lable tweets which have links for a blocked content.
In this challenge we faced the problem of the an insufficient quantity of tweets to make our classifier work better: for example, for the class Other there were only 395 post for training. We believe that an increase in the volume of data could make our modeling more effective, and external data sources could be helpful. Also, we intend to experiment with the use of LSTMs.