IKM at SemEval-2017 Task 8: Convolutional Neural Networks for stance detection and rumor verification

This paper describes our approach for SemEval-2017 Task 8. We aim at detecting the stance of tweets and determining the veracity of the given rumor. We utilize a convolutional neural network for short text categorization using multiple filter sizes. Our approach beats the baseline classifiers on different event data with good F1 scores. The best of our submitted runs achieves rank 1st among all scores on subtask B.


Introduction
Rumors in social networks are widely noticed due to the broad success of online social media. Unconfirmed rumors usually spark discussion before being verified. These have created cost for society and panic among people. Rather than relying on human observers to identify trending rumors, it would be helpful to detect them automatically and limit the damage immediately. However, identifying false rumors early is a hard task without sufficient evidence such as responses, retweet and fact checking sites. Instead of propagation structure, context-level patterns are more obvious and useful for the identification of rumors at this stage -in particular, observing the different patterns of stances amongst participants (Qazvinian et al., 2011).
Recent research has proposed a 4-way classification task to encompass all the different kinds of reactions to rumors . A schema of classifications including supporting, deny-ing, querying and commenting (SDQC) is applied in SemEval2017 Task 8.
In this paper, we describe a system for stance classification and rumor verification in tweets. For the first task, we are given tree-structured conversations, where replies are triggered by a source tweet. We need to categorize the replies into one of the SDQC categories by reply-source pairs. The second task is about rumor verification. Our system is for the closed variant -which means the veracity of a rumor will have to be predicted solely without external data.
It is a challenging NLP task. Statements containing sarcasm, irony and metaphor often need personal experience to be able to infer their broader context (Kreuz and Caucci, 2007). Furthermore, lots of background knowledge is required to do the fact checking (Reichel and Lendvai, 2016).
In this paper, we develop convolutional neural network models for both tasks. Our system relies on a supervised classifier, using text features of different word representation methods such as learning word embedding through training and pre-trained word embedding model like GloVe (Pennington et al., 2014). The experiment section presents our results and discusses the performance of our work.

Related Work
Rumor verification from online social media has developed into a popular subject in recent years. The most common features were proposed by Castillo (2011) who classified useful features into four categories: message-based features, userbased features, topic-based features, and propagation-based features. However, this approach is limited because of the data skew problem when false rumors are less common. Thus, most existing approaches attempt to classify truthfulness by utilizing information beyond the content of the posts -propagation structure, for example. Ke Wu (2015) et al., proposed a novel message propagation pattern based on the users who transmit this message. But most of these features are available only when the rumors have been responded to by many users. Our task, on the other hand, is to do the initial classification on content features which are available much earlier.

System Overview
Our system employs a convolutional neural network mainly inspired by Kim (2014). We chose models by testing on LOO (Leave One Out) validation performance. LOO can be simply explained as that we test on each conversation thread by retraining models on the other threads. In the following section, our CNN Tweet Model is briefly explained.

Data Preprocessing
Before applying the models, we need to do some transforms of the irregular input text. At first, we remove URLs and username with '@' tags that do not contribute to sentiment analysis. In this case, URLs and usernames are considered as noise without external data. Furthermore, we convert all letters to lower case. Besides removal, it is worth mentioning that we leave important clues such as hashtags and some special characters. Question marks and exclamation marks, for example, have proven to be helpful (Zhao, 2015).

Convolutional Model
There are two steps for the process of encoding tweets into matrices that are then passed to the input layer. This model is illustrated in Figure 1. First, we use word embedding to convert each word in the tweet into a vector. We randomly initialize the word embedding matrix. Each row of this matrix is a vector that represents a word in the vocabulary. Then we learn the embedding weights during the training process. Second, we concatenate these word vectors to produce a matrix representing the sentence. In the matrix, each row represents one word in the tweet as follows: Where is a word matrix formed by the concatenation of each word vector.
In the convolutional layer, we use tm as input and select a window size to slide over the matrix. To extract local features in the region of the window, a filter matrix ∈ × is used to produce element-wise multiplication and nonlinear operations on the matrix values in the window at every position. The following is an example of this operation: Where is the filter matrix. The values of the filter matrix will be learned by the CNN from the training process. is the bias term, is the nonlinear function, and is an element of a local feature vector. After we slide the window through the whole matrix, we get a local feature vector of the input tweet as: Where ∈ − +1 is a local feature vector with n-y+1 elements.
For the purpose of dealing with continuous n words which may represent special meaning in NLP (e.g. "Boston Globe"), we use multiple window sizes to produce different feature vectors. Thus, the idea of a different window size applied to capturing features is similar to n-grams. Meanwhile, we use different filter matrices to extract A pooling layer is used for simplifying the information of the output from the convolutional layer. We extract the maximum value from each local feature vector to form a condensed representation vector. For every local feature vector, only the most important feature is extracted and noise is ignored. After the max-pooling operation, we can concatenate all maximum values of each column as follows: Where is the global feature vector representing the tweet.
Through the pooling layer, if we use the same window size and filter matrix on different tweets, we can make sure the global feature size is fixed.
For classification, we feed the global feature vectors of the tweet into a fully connected layer to calculate the probability distribution. A softmax activation function is applied as follows: Where is the input vector, ′ is the ′ -th column of weight matrix . With the probabilities over the four classes, we take the class with the maximum value as the label for the given input tweet.

Tasks and Model Training
During the training phase, our CNN model automatically learns the values of its filters based on the task.
In task A, the tweets are classified into four categories: supporting, denying, querying and commenting. We defined the ground truth vector p as a one-hot vector. The parameter d used in the word embedding is 128. The number of filters in the convolutional layers is 128. The probability of dropout is set to 0.5. Adam Optimization algorithm is used to optimize our network's loss function. Moreover, there are three filter region sizes in our system: 2, 3 and 4, each of which has 2 filters.
In order to deal with the imbalance of classes in the data, balanced mini-batching was applied. In the statistics, more than 64% of the instances belong to the commenting class. We chose 16 in-stances with each class from training set randomly, which means that there are 64 instances in a batch.
A voting scheme is applied to decrease the uncertainty of training on randomly selected samples. We trained 5 models to predict the same testing data and took a vote for the final prediction. By performing training multiple times independently we achieved more robust results. In subtask B, most of the parameter settings were the same as in Task A. Because the output classes are rumor and non-rumor, we discard the label "unverified". In addition, we use the probability in section 3.2 to define the credibility of our answer c. The credibility in the interval [0, 1] is normalized as:

Evaluation
We conduct experiments using the rumor datasets annotated for stance . The statistics of the datasets are shown in Table 1. For subtask B, conversation threads are not available for the participants and the use of external data is forbidden on the closed variant.

Baselines
We compare our result with Lukasik's (2016) in Table 2. We follow their LOO settings and test on the same dataset. The report includes accuracy (Acc) and macro average of F 1 scores across all labels (F 1 ) from Lukasik's baseline.  Table 3 lists the results of using different window sizes for the filters in the tweet encoding process. We set different window sizes to observe the impact. The experiment was performed with the same settings as in section 5.1 for the Ottawa event. We obtain the best performance when the window size combination is (3, 4, 5). Different window sizes 2, 3 and 4 correspond to the encoding for the bigrams, trigrams and four-grams of the tweets respectively. We can see that the performance decrease slightly with the window size increases. That is, insufficient grams can lose some features while too many grams can bring noise.

Official Results 1
Our submission results to the subtask A achieve an accuracy of 0.701. The statistical details of each class are given in Table 4. We notice that the comment stance is the easiest to detect, since they take a large part of the data. The number of query stances are similar to support and deny, while it has much better precision and recall because the features of queries are more obvious. Likewise, there are some negative words in the deny stance 1 Results and task detail can be found on http://alt.qcri.org/semeval2017/task8/ as features. However, it is challenging to extract features of supporting which results in a poorer performance.
The rank of subtask B is summarized in Table  5. As we can see our model performs best among the official scores. Our code is available on github for anyone who has interest in further exploration 2 .

Conclusion
We develop a convolutional neural network system for detecting twitter stance and rumor veracity determination in this paper. Compared with the baseline approach, our system obtains good results on stance detection. In addition, on the test set of SemEval2017 Task8B, we ranked 2nd in the official evaluation run.