YNU_DYX at SemEval-2019 Task 5: A Stacked BiGRU Model Based on Capsule Network in Detection of Hate

This paper describes our system designed for SemEval 2019 Task 5 “Shared Task on Multilingual Detection of Hate”.We only participate in subtask-A in English. To address this task, we present a stacked BiGRU model based on a capsule network system. In or- der to convert the tweets into corresponding vector representations and input them into the neural network, we use the fastText tools to get word representations. Then, the sentence representation is enriched by stacked Bidirectional Gated Recurrent Units (BiGRUs) and used as the input of capsule network. Our system achieves an average F1-score of 0.546 and ranks 3rd in the subtask-A in English.


Introduction
tudes toward life and discuss current issues (Pak and Paroubek, 2010). Many of the content is related to people's feelings, so many people begin to conduct emotional analysis and research on tweets. SemEval 2019 Task 5 is to detect hate speech on tweets. Task A is a binary classification task that predicts whether English or Spanish tweets for specific goals (women or immigrants) are hateful or not hateful (Basile et al., 2019). There are many studies that currently use tweets as a corpus for natural language processing (NLP). Text classification using traditional machine learning methods mainly includes Support Vector Machines (SVMs) (Gunn et al., 1998), Naive Bayes (McCallum et al., 1998) and Random Forests (Cutler et al., 2007), etc. In recent years, the use of deep neural networks for NLP has become mainstream, such as Convolutional Neural Networks (CNNs) for sentence classification (Kim, 2014) and Recurrent Neural Networks (RNNs) (Graves et al., 2013).
This task aims to predict whether the tweet for each ID is a hate speech about women or immigrants. Our system implements a stacked Bidirectional Gated Recurrent Units (BiGRUs) (Cho et al., 2014) based on a capsule network. The vector representations of words are obtained with fast-Tex. The result of the classification is through the output of a fully connected layer. The rest of this paper is organized as follows: Data Processing and analysis are discussed in section 2. Section 3 provides the details of the proposed model. Experiments and results are described in Section 4. Finally, we draw conclusions in Section 5.

Data Processing
This part describes the experimental data and data processing analysis of SemEval 2019 Task 5 subtask-A in English.

Experimental Data
This is a binary classification task of hate speech about immigrants or women. The task organizers provide training sets, development sets and test sets, respectively.

Processing Data
We perform a series of standard processing on datasets.
• All punctuation marks are removed.
• All characters are converted to lowercase.
• All hyperlinks are replaced by "url".
• All numbers are replaced by "number" • All contractions are normalized, like place "shouldn't" with "should not" and "dosen't" with "does not" and so on.
• All @specific user names are replaced with usernames, for example "@PdxPatriot1" is replaced with "username".
We consider the specific length of the sentence in the input model. If it is too long, the calculation time of the training model will increase. If it is too short, it will lose extra information. So we choose twice the average value, which is 45, as the final length of the sentence in the input model, so that the lost information will not be too much, and the calculation time will not be too long. In the training set, the development set and the test set have 473, 122, and 102 sentences respectively longer than 45, and the maximum sentence length is 65.

System Description
Our system can be roughly divided into two parts: the space vector representation of the words and the learning of the tweet content by the capsule network. We first map the words into a lowdimensional space vector, then feed the sentence vectors composed of these word vectors into a capsule network to learn the sentence features, and finally classify the text of the test set by a softmax function.

Word Representation
Representing a word by using a low-dimensional vector is currently the most common method in natural language processing. The fastText (Joulin et al., 2017) tool is used in our system to get the word representation of the sentences. A lowdimensional vector in fastText is associated with each word, and hidden representations can be shared between different classes of classifiers so that textual information can be used together in different classes. So fastText is a very efficient, word-based vectorization model for text classification. The pre-trained fastText embedding is used in our system 2 .

Model Description
In order to enrich the word vector representation in the text, we use a stacked Bidirectional Gated Recurrent Units (BiGRUs) (Cho et al., 2014). The output of BiGRU is then used as the input to the capsule network (Sabour et al., 2017). The final result is obtained by the sof tmax activation function in the fully connected layer. The model architecture is show in Figure 1. Targeted Dropout Layer: Dropout regularization only activates some local neurons in each forward propagation, so it adds sparsity properties during training. This encourages the neural network to learn a representation that is robust to sparsification, that is, to randomly delete a set of neurons. Targeted Dropout (Gomez et al., 2018) sorts weights or neurons based on some measure of fast approximation weight importance and applies Dropout to those elements of lower importance. This approach encourages neural networks to learn more important weights or neurons. In other words, the network learns to be robust to our choice of post hoc pruning strategy. At the same time it is easy to implement with Keras 3 .
Stacked BiGRU: To get more fine-grained sentence information, we use stacked Bidirectional Gated Recurrent Units (BiGRUs) to encode sentence information. The "stack" here refers to 2, which is 2 layers BiGRU. The information of the sentence is directional. The forward GRU can only get the information from the front to the back of the sentence, and can't encode the information from the back to the front. BiGRU better captures semantic dependencies in both directions.
Capsule Layer: The capsule network (Sabour et al., 2017) replaces a single neuron node of a traditional neural network with a neuron vector, and trains a completely new neural network in the way of Dynamic Routing, which effectively improves the low efficiency and space insensitivity of the CNN model. The capsule network is connected the same way as a fully connected network. Each capsule neuron in the previous layer is connected to each capsule neuron in the next layer. Each connection of the capsule network is also weighted. The difference is that there is a coupling coefficient on the connection of the capsule network. 2 https://fasttext.cc/docs/en/english-vectors.html 3 https://pypi.org/project/keras-targeted-dropout/ The coupling coefficient is determined by the iterative dynamic routing process.

Evaluation
To evaluate the performance of the classification system, the system uses a standard evaluation metrics that includes accuracy, precision, recall, and F 1 -score. In this task we use F 1 -score to measure the performance of the proposed method. Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall is the ratio of correctly predicted positive observations to the all observations in actual class. F 1 -score is the weighted average of Precision and Recall. Precision and recall have equal contributions to F 1 -score. The formula for F 1 -score is defined as:

Hyperparameter
The Targeted Dropout layer has two parameters, drop rate and target rate. In this system, these two parameters are both set to 0.55. For the stacked BiGRU, the first layer BiGRU units = 64, and the second layer BiGRU units = 64.
The parameters of the capsule layer are set as follows: routings = 5, the number of caspule is 10 and the dimension is 32.
Finally, at the full connection layer output, we added two parameters, kernel regularizer and activity regularizer, respectively. Kernel regularizer uses l 2 regularization with a parameter of 0.001, activity regularizer is l 1 regularization, and the parameter is also set to 0.001.
Usually the multi-classification problem uses categorical crossentropy as the loss function. But our system uses binary crossentropy in this binary classification.

Experiments and Result Analysis
We conduct several experiments to gain insight into the performance of the proposed model. First we compare the normal Dropout and Targeted Dropout performance. It can be seen from Table 2 that the performance of Targeted Dropout is significantly better than that of Dropout. Model performance increases by 5% on average F 1 -score.

Sets
Acc P R F 1 Dropout 0.53 0.58 0.61 0.52 Targeted Dropout 0.56 0.64 0.60 0.55 To determine the specific parameters of the Targeted Dropout, we do a lot of comparison experiments. As can be seen from Table 3, the best parameter is 0.55. This is also the parameter we submitted to the system in the competition.  We compare the four network architectures based on a capsule network, LSTM, GRU, BiLST-M and BiGRU. We observe that the performance of BiGRU is better than the other three in this task. Compared to MFC baseline and SVC baseline, our method increases the average F 1 -score by 0.18 and 0.10, respectively, as is shown in Table 4.
The values of MFC baseline and SVC baseline come from the data published by the organizer 4 . To ensure the fairness of the experiment, the parameters of the capsule network remain unchanged, using the parameters mentioned in section 4.2.

Conclusion and Future Work
In this paper, we present a stacked BiGRU model based on a capsule network system in the task "Shared Task on Multilingual Detection of Hate". We replace Dropout with Targeted Dropout, the effect is more obvious, indicating that Targeted   Dropout is effective in this system. At the same time, we have conducted several experiments to find the optimal parameters of Targeted Dropout.
Through comparative experiments, BiGRU is the best model based on capsule networks. Due to time limit, we don't tune the parameters of the capsule network. In the future, we will adjust the parameters of the capsule network to optimize the performance of the model. Secondly, we are going to try ensemble methods such as hard voting, soft voting and stacking to find the one that works best for our task. Finally, we would like to explore transfer learning technology.