Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM

This paper describes team Turing’s submission to SemEval 2017 RumourEval: Determining rumour veracity and support for rumours (SemEval 2017 Task 8, Subtask A). Subtask A addresses the challenge of rumour stance classification, which involves identifying the attitude of Twitter users towards the truthfulness of the rumour they are discussing. Stance classification is considered to be an important step towards rumour verification, therefore performing well in this task is expected to be useful in debunking false rumours. In this work we classify a set of Twitter posts discussing rumours into either supporting, denying, questioning or commenting on the underlying rumours. We propose a LSTM-based sequential model that, through modelling the conversational structure of tweets, which achieves an accuracy of 0.784 on the RumourEval test set outperforming all other systems in Subtask A.


Introduction
In stance classification one is concerned with determining the attitude of the author of a text towards a target (Mohammad et al., 2016). Targets can range from abstract ideas, to concrete entities and events. Stance classification is an active research area that has been studied in different domains (Ranade et al., 2013;Chuang and Hsieh, 2015). Here we focus on stance classification of tweets towards the truthfulness of rumours circulating in Twitter conversations in the context of breaking news. Each conversation is defined by a tweet that initiates the conversation and a set of nested replies to it that form a conversation thread. The goal is to classify each of the tweets in the conversation thread as either supporting, denying, querying or commenting (SDQC) on the rumour initiated by the source tweet. Being able to detect stance automatically is very useful in the context of events provoking public resonance and associated rumours, as a first step towards verification of early reports (Zhao et al., 2015). For instance, t has been shown that rumours that are later proven to be false tend to spark significantly larger numbers of denying tweets than rumours that are later confirmed to be true (Mendoza et al., 2010;Procter et al., 2013;Derczynski et al., 2014;Zubiaga et al., 2016b).
Here we focus on exploiting the conversational structure of social media threads for stance classification and introduce a novel LSTM-based approach to harness conversations.

Related Work
Single Tweet Stance Classification Stance classification for rumours was pioneered by Qazvinian et al. (2011) as a binary classification task (support/denial). Zeng et al. (2016) perform stance classification for rumours emerging during crises. Both works use tweets related to the same rumour during training and testing.
A model based on bidirectional LSTM encoding of tweets conditioned on targets has been shown to achieve state-of-the-art on the SemEval-2016 task 6 dataset (Augenstein et al., 2016). However the RumourEval task is different as it addresses conversation threads. Lukasik et al. (2016) and Zubiaga et al. (2016a) consider the sequential nature of tweet threads in their works. Lukasik et al. (2016) employ Hawkes processes to classify temporal sequences of tweets. They show the importance of using both the textual content and temporal information about the tweets, disregarding the discourse structure. Zubiaga Figure 1: Example of a conversation thread from the dataset with three branches, two of which are highlighted. The conversation has a tree structure, which can be split into individual branches by taking each leaf node with all its direct parents.

Sequential Stance Classification
al. (2016a) model the conversational structure of source tweets and subsequent replies: as a linear chain and as a tree. They use linear-and tree-versions of a CRF classifier, outperforming the approach by Lukasik et al. (2016).

Dataset
The dataset provided for this task contains Twitter conversation threads associated with rumours around ten different events in breaking news, including the Paris shootings in Charlie Hebdo, the Ferguson unrest, the crash of a Germanwings plane. These events include 325 conversation threads consisting of 5568 underlying tweets annotated for stance at the tweet level (breakdown between training, testing and development sets is shown in Table 1) (Derczynski et al., 2017).   Each thread includes a source tweet that initiates a conversation and nested tweets responding to either the source tweet or earlier replies. The thread can be split into linear branches of tweets, where a branch is defined as a chain of tweets that starts with a leaf tweet including its direct parent tweets, all the way up to the source tweet. Figure 1 shows an example of a conversation along with its annotations represented as a tree structure with highlighted branches. The depth of a tweet is the number of its parents starting from the root node. Branches 1 and 2 in Figure 1 have depth one whereas branch 3 has depth three. There is a clear class imbalance in favour of commenting tweets (66%) and supporting tweets (18%), whereas the denying (8%) and querying classes (8%) are under-represented (see Table 2). While this imbalance poses a challenge, it is also indicative of the realistic scenario where only a few users question the veracity of a statement.

Features
Prior to generating features for the tweets, we perform a pre-processing step where we remove nonalphabetic characters, convert all words to lower case and tokenise texts. 1 Once tweet texts are preprocessed, we extract the following features: • Word vectors: we use a word2vec (Mikolov et al., 2013) model pre-trained on the Google News dataset (300d) 2 using the gensim package (Řehůřek and Sojka, 2010). • Tweet lexicon: (1) count of negation words 3 and (2) count of swear words. 4 1 For implementation of all pre-processing routines we use Python 2.7 with the NLTK package. 2 We have also tried using Glove word embeddings trained on Twitter dataset, but it lead to a decrease in performance on both development and testing sets comparing to the Google News word vectors 3 A presence of any of the following words would be considered as a presence of negation: not, no, nobody, nothing, none, never, neither, nor, nowhere, hardly, scarcely, barely, don't, isn't, wasn't, shouldn't, wouldn't, couldn't, doesn't 4 A list of 458 bad words was taken from http://urbanoalvarez.es/blog/2008/04/04/bad-words-list/ [As querying] @username Weren't you the one who abused her?
[As supporting] "Go online &amp; put down 'Hillary Clinton illness,'" Rudy says. Yes -but look up the truth -not health smears https://t.co/EprqiZhAxM [As supporting] @username I demand you retract the lie that people in #Ferguson were shouting "kill the police", local reporting has refuted your ugly racism [As commenting] @FoxNews six years ago... real good evidence. Not! Figure 2: Examples of misclassified denying tweets.
• Punctuation: (1) presence of a period, (2) presence of an exclamation mark, (3) presence of a question mark, (4) ratio of capital letters. • Attachments: (1) presence of a URL and (2) presence of images. • Relation to other tweets (1) Word2Vec cosine similarity wrt source tweet, (2) Word2Vec cosine similarity wrt preceding tweet, and (3) Word2Vec cosine similarity wrt thread • Content length: (1) word count and (2) character count. • Tweet role: whether the tweet is a source tweet of a conversation. Tweet representations are obtained by averaging word vectors in a tweet and then concatenating with the additional features into a single vector, at the preprocessing step. This set of features have shown to be the best comparing to using word2vec features on their own or any of the reduced combinations of these features.

Branch -LSTM Model
To tackle the task of rumour stance classificaiton, we propose branch-LSTM, a neural network architecture that uses layers of LSTM units (Hochreiter and Schmidhuber, 1997) to process the whole branch of tweets, thus incorporating structural information of the conversation (see the illustration of the branch-LSTM on the Figure 3). The input at each time step i of the LSTM layer is the representation of the tweet as a vector. We record the output of each time step so as to attach a label to each tweet in a branch 5 . This output is fed through several dense ReLU layers, a 50% dropout layer, and then through a softmax layer to obtain class probabilities. We use zero-padding and masks to account for the varying lengths of tweet branches. The model is trained using the categorical cross entropy loss function. Since there is overlap between branches originating from the same source tweet, we exclude the repeating tweets from the loss function using a mask at the training stage. The model uses tweet representation as the mean average of word vectors concatenated with extra features described above. Due to the short length of tweets, using more complex models for learning tweet representations, such as an LSTM that takes each word as input at each time step and returns the representation at the final time step, does not lead to a noticeable difference in the performance based on cross-validation experiments on the training and development sets, while taking significantly longer to train. We experimented with replacing the unidirectional LSTMs with bidirectional LSTMs but could observe no improvements in accuracy (using cross-validation results on the training and development set).

Experimental Setup
The dataset is split into training, development and test sets by the task organisers. We determined the optimal set of hyperparameters via testing the performance of our model on the development set for different parameter combinations. We used the

Results
The performance of our model on the testing and development set is shown in Table 3. Together with the accuracy we show macro-averaged Fscore and per-class macro-averaged F-scores as these metrics account for the class imbalance. The difference in accuracy between testing and development set is minimal, however we see significant difference in Macro-F score due to different class balance in these sets. Macro-F score could be improved if we used it as a metric for optimising hyper-parameters. The branch-LSTM model predicts commenting, the majority class well, however it is unable to pick out any denying, the mostchallenging under-represented class. Most deny- 6 We used the implementation of the TPE algorithm in the hyperopt package (Bergstra et al., 2013)   ing instances get misclassified as commenting (see Table 5), with only one tweet misclassified as querying and two as supporting (Figure 2). An increased amount of labelled data would be helpful to improve performance of this model. As we were considering conversation branches, it is interesting to analyse the performance distribution across different tweet depths (see Table 4). Maximum depth/branch length in the testing set is 13 with most tweets concentrated at depths from 0 to 3. Source tweets (depth zero) are usually supporting and the model predicts these very well, but performance of supporting tweets at other depths decreases. The model does not show a noticeable difference in performance on tweets of varying lengths.

Conclusions
This paper describes the Turing system entered in the SemEval-2017 Task 8 Subtask A. Our method decomposes the tree structure of conversations into linear sequences and achieves accuracy 0.784 on the testing set and sets the state-of-the-art for rumour stance classification. In future work we plan to explore different methods for modelling tree-structured conversations.