pkudblab at SemEval-2016 Task 6 : A Specific Convolutional Neural Network System for Effective Stance Detection

In this paper, we develop a convolutional neural network for stance detection in tweets. According to the ofﬁcial results, our system ranks 1 st on subtask B (among 9 teams) and ranks 2 nd on subtask A (among 19 teams) on the twitter test set of SemEval2016 Task 6. The main contribution of our work is as follows. We design a ”vote scheme” for prediction instead of predicting when the accuracy of validation set reaches its maximum. Besides, we make some improvement on the speciﬁc sub-tasks. For subtask A, we separate datasets into ﬁve sub-datasets according to their targets, and train and test ﬁve separate models. For subtask B, we establish a two-class training dataset from the ofﬁcial domain corpus, and then modify the softmax layer to perform three-class classiﬁcation. Our system can be easily re-implemented and optimized for other related tasks.


Introduction
There are several requirements for stance detecting applications on the internet. However it is unpractical for humans to classify massive amounts of tweets. Twitter stance detection aims to automatically determine the emotional tendency of tweets. To classify tweets polarity, mainstream approaches are based on Pang (Pang et al., 2002), like regression problem, using machine learning algorithm to build classifiers from tweets with manually annotated polarity to classify the polarity of a tweet (Jiang et al., 2011;Hu et al., 2013;Dong et al., 2014). In this direction, most studies focus on designing effective features to obtain better classification performance (Pang and Lee, 2008;Liu, 2012;Murakami and Raymond, 2010). For example, Mohammad (Mohammad and Turney, 2013) implements some sentiment lexicons and several manually-selected features. To leverage massive tweets containing positive and negative emoticons for automatically feature learning, Tang  proposes to learn sentiment-specific word embedding. We transfer this method to detect tweets stance.
In this paper, we develop a specific convolutional neural network learning model for stance detection. Firstly, we learn word embedding from Google News database as the input of our system. Afterwards, we train the CNN model with the Se-mEval2016 Task 6 dataset. Finally, we design a "vote scheme" using the softmax results to predict the label of test set. We also make some task specific improvement. For subtask A, we separate datasets into five sub-dataset, and train and test five separate models. For subtask B, we establish a twoclass training dataset from the official domain corpus based on several special expressions. We evaluate our deep learning system on the test set of Se-mEval2016 Task 6. Our system ranks 1 st on subtask B and 2 nd on subtask A. The good performance in the Task 6 evaluation verifies the effectiveness of our model and schemes.

Architecture overview
The architecture of our convolutional neural network is mainly inspired by the architecture proposed by Kim, which performs well and efficiently in sentence classification tasks (Kim, 2014). The reason why we base on Kim's model is that there is much in common between stance detection task and sentence classification task when the amount and the distribution of dataset is rather reasonable. Our architecture is shown on Fig. 1.
In the following, we give a brief introduction of the main components of our network architecture in the connecting order: look-up table, input matrix, convolutional layer, activation function, pooling layer and softmax layer. We also describe the approach to train this model.

Look-up table
Look-up table is a huge word embedding matrix. Each column of the table, which is d-dimensional, corresponds to a word. Word embedding in the look-up table are pre-trained vectors published by word2vec team (Mikolov et al., 2013) 1 . These vectors are trained on part of Google News dataset (about 100 billion words).

Input matrix
An input matrix S, S ∈ R d×|s| , is the representation of an input sentence: [w 1 , w 2 , ..., w |s| ]. |s| is the length of the sentence, w i is the corresponding d-dimensional vector found in look-up table. If this word does not exist in the look-up table, make it a zero vector or a vector whose components are numbers randomly generated in a given range. 1 https://code.google.com/archive/p/word2vec/

Convolutional layer
The goal of the convolutional layer is to extract patterns, so that some common abstractive representation can be found among the dataset. Pattern means specific sequential words in a sentence. Patterns can be extracted by different filter matrixes F which are discriminatively sensitive to different patterns.
More formally, the convolution operation between an input sentence matrix S ∈ R d×|s| and a filter F ∈ R d×m , where m is an assigned width, is defined as follow: where 1 ≤ i ≤ |s| − m + 1. S [:,i:i+m−1] is a matrix slice of size m along the columns and is the element-wise multiplication. Both S and F have the same d rows. As shown on Fig. 1, filter F slides along the column dimension of S generating vector c: [c 1 , c 2 , ..., c |s|−m+1 ], named feature map.
So far we have introduced how to compute a convolution between the input sentence matrix and a single filter. To get a richer representation of the dataset, we apply n filters on every input sentence matrix to compute feature maps matrix C, C ∈ R (|s|−m+1)×n . Note that every input sentence matrix has a corresponding C matrix and every column of matrix C corresponds to a convolution result between a filter and this input sentence matrix.
In practice, we also add a bias vector b ∈ R n to every row of matrix C element-wise to train a more appropriate model.

Activation function
To fit the non-linear boundaries better, convolutional layer is always followed by a non-linear activation function f () in practice. f () is applied element-wise on feature maps matrix C. Among the most popular choices of activation functions: sigmoid, tanh (hyperbolic tangent) and ReLU (rectified linear), we finally choose ReLU, since it is rather simple and sometimes more efficient 2 .

Pooling layer
For the purpose of simplifying the information in the output from the convolutional layer (passed through the activation function), pooling layer is used. We adopt the max-pooling method, which is choosing the maximum value from every column of f (C) (f () is the ReLU operation), to form a condensed representation vector. More formally, after the max-pooling operation, f (C) ∈ R (|s|−m+1)×n → pool(f (C)) ∈ R 1×n , which is also shown on Fig. 1.

Output layer: softmax
The fully connected softmax layer is for classification. To a K-class dataset, the probability distribution of j-th class is as follows: where x is the input vector (the vector produced by pooling layer in our network), w k and b k , having the same dimensionality as the input vector x, are weight vector and bias vector of the k-th class respectively. Softmax layer calculates the probability of each class and then chooses the class having the maximum value as the predict label.

Approach to train the network
The parameters trained by our network are as follows: where W is the word embedding of all words in dataset, including those found in the look-up table and those randomly assigned; F is the set of all the filters; b is the bias vector in the convolutional layer; w k and b k are the weight and the bias vector of k-th class in the softmax layer. We use backpropogation algorithm to optimize these parameters and we adopt Adadelta (Zeiler, 2012) update rule to automatically tune the learning rate. We also opt our network by another two methods: l2-norm regularization terms for the parameters to mitigate overfitting issues and dropout scheme (Srivastava et al., 2014), which is to set the chosen value zero, to prevent feature co-adaptation.

Improvement for stance detection
In this session, we briefly introduce our improvement on the CNN archtecture we described above. The improvement is task specific.
Vote scheme. We validate our model by cross validation method. For the models of subtask A and subtask B, we design ten parallel epochs, whose validation sets are randomly selected from the training set and non-overlap.
Different from general network, we design a "vote scheme" for prediction instead of predicting when the accuracy of validation set reaches its maximum. In each epoch, we choose some iterations deliberately to predict the test set. Then, when this epoch ends, for every sentence in the test set, we appoint the label which appears most frequently in these predictions as the result of this epoch. Finally, when ten epochs end, we vote within results of these ten epochs by the same method described above to determine the final labels.
By performing multiple times independently and voting twice, we get a rather robust mechanism for predicting.
"divide and conquer" scheme. For subtask A, we separate both training and test datasets respectively into five sub-datasets according to their targets, and then train and test five separate models with these divided datasets. The contrast experiment between this "divide and conquer" model and the model trained by the integral dataset is shown in Session 4.
"2-step" scheme. For subtask B, in the condition that the official corpus is unlabeled whereas training set is necessary for our supervised model, we come up with a solution having two steps: 1. Build a twoclass training dataset; 2. Modify the softmax layer to perform three-class classification on the two-class training dataset. According to some expressions and hashtags revealing a distinct tendency, for example, "go trump" and "#MakeAmericaGreatAgain" reveal favor tendency whereas "idiot" and "fired" reveal against tendency, we finally establish a two-class training dataset, which has about 2000 favor tweets and about 3000 against tweets, from the domain corpus for subtask B (Mohammad et al., 2016). Then, we modify the softmax layer. For a test sentence, if the absolute value of the subtraction between the probability values of the two classes is less than a randomly selected real number α(α ∈ [0.05, 0.1]), predict this sentence as "None" stance. Otherwise, predict it the class having the greater probability value.

Experiments and evaluation
Dataset. For subtask A, the training set is the official training data for Task A (Mohammad et al., 2016). For subtask B, the training set is described in Session 3. Details about datasets are shown in Table 1.
Parameters setup. Word embedding matrix is described in Session 2.1, the dimensionality d is 300. We design three different width filters, 100 in width 3, 100 in width 4 and 100 in width 5, which means that there are 300 filters in total. We choose ReLU as activation function and we use max-pooling. L2-norm regularization term is set to 1e-6, the probability of dropout is set to 0.5. Bias vector b, as well as w k and b k in softmax layer are all set to zero vectors.
Test result. We perform contrast experiments on subtask A. The results of the "divide and conquer" model and its contrast integral model, as well as the five separate models are shown in Table 2. The description of these models is in Session 3. We can see from Table 2 that "divide and conquer" model does not always have a better performance. However, since the words using in the sentences which belong to the same target are expected to be more similar, the "divide" model still performs much better on some dataset (e.g. Atheism). The "divide and conquer" model is the one we submit for evaluation.
Official ranking. Part of the official rankings for both subtask A and B are summarized in Table 3. As we can see our model performs well on both subtasks. Our model ranks 2 nd on subtask A, whose official metric is only 0.5% lower than the first team.
On subtask B our model ranks 1 st , and the official metric is 56.28%, about 10% higher than the second team.

Conclusions
We develop a specific convolutional neural network system for detecting twitter stance in this paper. We give a detailed description of our model and specific adaptation for different subtasks. Among 28 submitted systems, our system obtains good rank on both subtask A and subtask B on the test set of Se-mEval2016 Task 6. Our system has good scalability for other related tasks.

Future work
Due to the tight schedule, there are still many aspects need to explore. For example, why the Google news word2vec performs well in this context? How much does this word embedding improve the score compared with randomly initial word embedding? Is the more suitble word embedding exists? What's more, the vote scheme is somewhat curt, we should do more experiment to validate its robustness. Our code is available in github for anyone who has a interest in further exploration 3 .