Leveraging Intra-User and Inter-User Representation Learning for Automated Hate Speech Detection

Hate speech detection is a critical, yet challenging problem in Natural Language Processing (NLP). Despite the existence of numerous studies dedicated to the development of NLP hate speech detection approaches, the accuracy is still poor. The central problem is that social media posts are short and noisy, and most existing hate speech detection solutions take each post as an isolated input instance, which is likely to yield high false positive and negative rates. In this paper, we radically improve automated hate speech detection by presenting a novel model that leverages intra-user and inter-user representation learning for robust hate speech detection on Twitter. In addition to the target Tweet, we collect and analyze the user’s historical posts to model intra-user Tweet representations. To suppress the noise in a single Tweet, we also model the similar Tweets posted by all other users with reinforced inter-user representation learning techniques. Experimentally, we show that leveraging these two representations can significantly improve the f-score of a strong bidirectional LSTM baseline model by 10.1%.


Introduction
The rapid rise in user-generated web content has not only yielded a vast increase in information accessibility, but has also given individuals an easy platform on which to share their beliefs and to publicly communicate with others. Unfortunately, this has also led to nefarious uses of online spaces, for instance for the propagation of hate speech.
An extensive body of work has focused on the development of automatic hate speech classifiers. A recent survey outlined eight categories of features used in hate speech detection (Schmidt and Wiegand, 2017): simple surface (Warner and Hirschberg, 2012;Waseem and Hovy, 2016), word generalization (Warner and Figure 1: Our hate speech classifier. In contrast to existing methods that focus on a single target Tweet as input (center), we incorporate intra-user (right) and interuser (left) representations to enhance performance. Hirschberg, 2012;Zhong et al., 2016), sentiment analysis (Van Hee et al., 2015), lexical resources and linguistic features (Burnap and Williams, 2016), knowledge-based features (Dinakar et al., 2012), meta-information (Waseem and Hovy, 2016), and multi-modal information (Zhong et al., 2016). Closely related to our work is research that leverages user attributes in the classification process such as history of participation in hate speech and usage of profanity (Xiang et al., 2012;Dadvar et al., 2013). Both Xiang et al. (2012) and Dadvar et al. (2013) collect user history to enhance detection accuracy. The former requires the user history to be labeled instances. However, labeling user history requires significant human effort. The latter models the user with manually selected features. In contrast, our approach leverages unlabeled user history to automatically model the user.
In this paper, we focus on augmenting hate speech classification models by first performing Figure 2: The overview of our proposed model. t is the input target Tweet, z denotes intra-user Tweets, and x a is the selected inter-user Tweet. r ie is the inter-user representation, r ia is the intra-user representation, and r ta is the representation of the target Tweet. These three branches respectively correspond to the three branches illustrated in Figure 1. y i is the prediction at the time step i and s i is the state input for the agent at the time step i. The computing process is detailed in Section 2.3 representation learning to model user history without supervision. The hypothesis is that, by analyzing a corpus of the user's past Tweets, our system will better understand the language and behavior of the user, leading to better hate speech detection accuracy. Another issue is that using a single Tweet as input is often noisy for any machine learning classifier. For example, the Tweet "I'm not sexist but I can not stand women commentators" is actually an instance of hate speech, even though the first half is misleading. To minimize noise, we also consider semantically similar Tweets posted by other users. To do so, we propose a reinforced bidirectional long shortterm memory network (LSTM) (Hochreiter and Schmidhuber, 1997) to interactively leverage the similar Tweets from a large Twitter dataset to enhance the performance of the hate speech classifier. An overview of our approach is shown in Figure 1. The main contributions of our work are: • We provide a novel perspective on hate speech detection by modeling intra-user Tweet representations.
• To improve robustness, we leverage similar Tweets from a large unlabeled corpus with reinforced inter-user representations.
• We integrate target Tweets, intra-user and inter-user representations in a unified framework, outperforming strong baselines. Figure 2 illustrates the architecture of our model. It includes three branches, whose details will be described in the following subsections.

Bidirectional LSTM
Given a target Tweet, the baseline approach is to feed the embeddings of the Tweet into a bidirectional LSTM network (Hochreiter and Schmidhuber, 1997;Zhou et al., 2016;Liu et al., 2016) to obtain the prediction. This is shown in the middle branch in Figure 1. However, this method is likely to fail when the target tweet is noisy or the critical words for making predictions are out of vocabulary.

Intra-User Representation
The baseline approach does not fully utilize available information, such as the user's historical Tweets. In our approach, we collect the user's historical posts through the Twitter API. For a target Tweet t, suppose we collect m Tweets posted by this user: Z t = {z 1 , z 2 , ..., z m }. These intra-user Tweets are fed into a pre-trained model to obtain an intra-user representation. The pretrained model has the same structure as the baseline model. This is shown in the right branch in Figures 1 and 2. The intra-user representation is then combined with the baseline branch for the final prediction. The computation process is: where f ta is the bi-LSTM of the baseline branch; o ta is the output of the bi-LSTM; and l ta , l ia are linear functions. Similarly, f ia is the bi-LSTM of the intra-user branch and o ia is the output. r ta is the output prediction of the baseline branch. r ia is the intra-user representation, and σ is the nonlinear activation function.

Inter-User Representation
In addition to the user history, the Tweets that are semantically similar to the target Tweet can also be utilized to suppress noise in the target Tweet. We collect similar Tweets from large unlabeled Tweet set U by Locality Sensitive Hashing (LSH) (Indyk and Motwani, 1998;Gionis et al., 1999). Since the space of all Tweets is enormous, we use LSH to efficiently reduce the search space. For each target Tweet t, we use LSH to collect n nearest neighbors of t in U : x 1 , x 2 , ..., x n . These n Tweets form the inter-user Tweet set for t: X t = {x 1 , x 2 , ..., x n }.
Due to the size of this set, a policy gradientbased deep reinforcement learning agent is trained to interactively fetch inter-user Tweets from X t . The policy network consists of two layers as shown in the middle part of Figure 2 and the policy network is trained by the REINFORCE algorithm (Williams, 1992). At each time step i, the action of the agent is to select one Tweet x a from X t . x a is then fed into a bi-LSTM followed by a linear layer. The result is combined with the intra-user representation and the baseline prediction (the right and the middle branch in Figures 1 and 2) to get the prediction at time step i. At each time step, the bi-LSTM layer that encodes the selected inter-user is initialized with the output hidden state of the last time step. The number of time steps for each target Tweet is set to be a fixed number T so that Algorithm 1 Training Algorithm 1: for t in training set do 2: collect Xt and Zt; 3: compute intra-user representation ria(t); 4: end for 5: initialize parameters θp of the policy network; 6: initialize parameters θe of the other nets; 7: for epoch = 1, E do 8: for t in training set do 9: compute ota(t), rta(t); 10: compute the raw prediction y (t); 11: compute b(Xt); 12: xa = t; 13: compute oie(t); 14: initialize the state s(t)0; 15: for update θe on the loss L(θe) = e(y(t)T , y * ); 24: end for 25: end for the agent will terminate after T fetches. The final prediction occurs at the last time step. The computation is shown by the following equations.
y (t) = σ(l c (r ta (t) ⊕ r ia (t))) (7) where x b is the selected inter-user Tweet at time step i − 1. f ie is the bi-LSTM of the inter-user branch. o ie and h ie are the output and the hidden state. l c is a linear function. r ie is the inter-user representation. y is the prediction made without the inter-user branch and y is the prediction made with the inter-user branch. The symbol ⊕ means concatenation. The subscript i denotes time step i. The state at each time step for the agent is the concatenation of encoded inter-user Tweets, the output of the Bi-LSTM in the inter-user branch and the baseline branch, together with the intrauser representation in the intra-user branch (the dotted arrows in Figure 2). Each inter-user Tweet x j in X t is encoded by the bi-LSTM of the interuser branch (the dotted arrow through the Bi-LSTM of the inter-user branch).
b(x j ) = f ie (x j , 0) b is the output of the bi-LSTM of the inter-user branch. In order to differentiate with o ie men-tioned above, we use b. s(t) i [j] is the jth row of the state at time step i. By using reinforcement learning, the state for the agent is updated after each fetch of the interuser Tweet. Thus, the agent can interactively make selections and update the inter-user representations step by step. The reward v i for the agent at time step i is based on the original prediction without the agent and the prediction at the last time step with the agent. The computation is shown as: where e is the loss function; q(t) is the basic reward; and v(t) i is the modified reward at time step i. α is a positive number used to amplify the reward when the original classification is incorrect. The intuition of this reward is to train the agent to be able to correct the misclassified Tweets. When the original prediction and the last prediction are both correct, the reward is set to 0 to make the agent focus on the misclassified instances. The complete training process is shown in Algorithm 1. Before the training, the intra-user Tweets and inter-user Tweets are collected for each target Tweet. Then intra-user representations are computed, followed by the computation for initializing the environment and state for the agent. Next, the agent's actions, state updates, prediction, and reward are computed. Finally, the parameters are updated.

Experimental Settings
Dataset: We use the dataset published by Waseem and Hovy (2016). This dataset contains 16,907 Tweets. The original dataset only contains the Tweet ID and the label for each Tweet. We expand the dataset with user ID and Tweet text. After deleting the Tweets that are no longer accessible, the dataset we use contains 15,781 Tweets from 1,808 users. The published dataset has three labels: racism, sexism and none. Since we consider a binary classification setting, we union the first two sets. In the final dataset, 67% are labeled as non-hate speech, and 33% are labeled as hate speech. 1000 Tweets are randomly selected for The inter-user Tweet set is collected from the dataset via Locality Sensitive Hashing (LSH). In our experiments, we use a set size of either 50, 100 or 200 Tweets. At each time step, one Tweet is selected from the inter-user Tweet set by the policy agent. We also experimented with a second setting, in which we replace the agent by random selection. At each time step, an inter-user Tweet is randomly selected from X and fed into the interuser branch.

Results
We compare the above settings with six classification models: Supported Vector Machine (SVM) (Suykens and Vandewalle, 1999), Logistic Regression, attentional BI-LSTM, two CNN mod-els by Kim (2014), and a N-gram model (Waseem and Hovy, 2016). We evaluate these models on three metrics: precision, recall and F1 score. The results are shown in Table 1. We report results for |U | = 100 in Table 1, as results with sizes 50 and 200 are similar. We find that leveraging the intrauser information helps reduce false positives. The performance is further improved when integrating our model with inter-user similarity learning. Our results show that selection by the policy gradient agent is slightly better than random selection, and we hypothesize the effect would be more salient when working with a larger unlabeled dataset. The McNemar's test shows that our model gives significantly better (at p < 0.01) predictions than the baseline bi-LSTM and attentional bi-LSTM.

Error Analysis
There are two types of hate speech that are misclassified. The first type contains rare words and abbreviations, e.g. FK YOU KAT AND ANDRE! #mkr. Such intentional misspellings or abbreviations are highly varied, making it difficult for the model to learn the correct meaning. The second type of hate speech is satire or metaphor, e.g. Congratulations Kat. Reckon you may have the whole viewer population against you now #mkr. Satire and metaphors are extremely difficult to recognize. In the above two cases, both the baseline branch and the inter-user branch can be unreliable.

Conclusion
In this work, we propose a novel method for hate speech detection. We use bi-LSTM as the baseline method. However, our framework can easily augment other baseline methods by incorporating intra-user and reinforced inter-user representations. In addition to detecting potential hate speech, our method can be applied to help detect suspicious social media accounts. Considering the relationship between online hate speech and reallife hate actions, our solution has the potential to help analyze real-life extremists and hate groups. Furthermore, intra-user and inter-user representation learning can be generalized to other text classification tasks, where either user history or a large collection of unlabeled data are available.