USFD at SemEval-2016 Task 6: Any-Target Stance Detection on Twitter with Autoencoders

This paper describes the University of Shefﬁeld’s submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as “favor”, “against”, or “none”. In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic. To address the lack of target-speciﬁc training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression clas-siﬁer on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.


Introduction
Stance detection is the task of assigning stance labels to a piece of text with respect to a topic, i.e. whether a piece of text is in favour of "abortion", neutral, or against. Previous work considered targetspecific stance predictors in debates (Walker et al., 2012;Hasan and Ng, 2013) or news (Ferreira and Vlachos, 2016).
The variety of topics discussed on Twitter calls for developing methods that can generalise to any target, including targets not seen in the training data, which is the focus of Subtask B in Task 6 of SemEval 2016 (Mohammad et al., 2016). A further challenge is that the targets are not always mentioned in the tweets, which distinguishes this task from target-dependent sentiment analysis Zhang et al., 2016), and open-domain target-dependent sentiment analysis (Mitchell et al., 2013;. The SemEval Stance Detection task is further related to that of textual entailment (Dagan et al., 2005;Bowman et al., 2015;Lendvai et al., 2016), i.e. we judge if a hypothesis (tweet in our task) entails, contradicts or is neutral towards a textual premise (target in our task). However, the premises in typical RTE datasets offer a richer context than the stance detection targets, i.e. they are full sentences instead of topic labels such as "atheism". Simple baselines such as textual overlap can achieve an F1 of >0.5 (Bowman et al., 2015), whereas for stance detection such baselines would not perform well, as the target is only mentioned in about half the tweets.
In our approach we learn a 3-way logistic regression classifier to perform stance detection. Apart from the standard bag-of-words features commonly used in sentiment analysis, we also use features from a trained bag-of-words autoencoder similar to the one used by Glorot et al. (2011). In our experiments we show that the bag-of-words autoencoder trained on a large amount of unlabelled tweets about the targets can help generalise to unseen targets better; on our development set it achieves an 8% increase over our best baseline. Further, tweets which contain the target are easier to classify correctly than tweets which do not contain the target. Such information can be useful for stance detection and we experiment with different ways of integrating it, finding that including a binary feature "targetContainedIn-Tweet" outperforms including features extracted by applying the autoencoder to the target.

Method Description
At the core of our stance detection approach is a classifier trained on tweets stance-labelled with respect to a target. For this purpose we used the logistic regression classifier from scikit-learn with L2 regularisation (Pedregosa et al., 2011) 1 . In what follows we describe the various feature representations we used and the data pre-processing. Resources to reproduce our experiments are available on Github 2 .
The stages of our approach are: a) unlabelled tweets about the targets; b) preprocess the data; c) train a bag-of-word autoencoder on all task data and unlabelled collected tweets, d) apply the autoencoder to all labelled training tweets to get a fixed-length feature vector; add a "does target appear in tweet" feature; and e) train a logistic regression model and apply it to the test tweets.

Autoencoder Training
After tweets are tokenised, a bag-of-word autoencoder is trained on them. To do so, a vocabulary of the 50000 most frequent words is constructed. The input to the autoencoder is a vector input dim for each training example of size 50000. Each index i in input dim[i] corresponds to a word in the vocabulary, input dim[i] is 1 if the tweet contains the corresponding word in the vocabulary and 0 otherwise. During autoencoder training, an encoder, i.e. embedding function is learned which maps input of size input dim to an embedding of size output dim, as well as a decoder which reconstructs the input. We apply the encoder to the training and test data to obtain features of size output dim for supervised learning and disregard the decoder. While it would be possible to train an encoder which preserves word order, i.e. an LSTM (Li et al., 2015), we opt for a simpler bag-of-word autoencoder here, following Glorot et al. (2011).
The architecture of the autoencoder is as follows: input dim is 50000, it has one hidden layer of di-mensionality 100, and output dim is of size 100. A dropout of 0.1 is added to the hidden layer (Srivastava et al., 2014). The autoencoder is trained with Adam (Kingma and Ba, 2014), using the learning rate 0.1, for 2600 iterations. In each iteration, 500 training examples are selected randomly.
Additional tweets are collected: 395212 tweets, tweeted between 18 November and 13 January, collected with the Twitter Keyword Search API 3 using up to two keywords per target (hillary, clinton, trump, climate, femini, aborti). Note that Twitter does not allow for regular expression search, so this is a free text search disregarding possible word boundaries.

Feature Extraction
The autoencoder is applied to the labelled data to get an 100-dimensional feature vector. For the final run, it was only applied to the tweets, but we also experiment with applying it to the target (see Section 3.2).
One additional binary feature is used for the final run, targetInTweet, which indicates if the name of the target is contained in the tweet. The following mapping was used for this purpose: 'Hillary Clinton' → 'hillary', 'clinton'; Donald Trump → 'trump'; 'Climate Change is a Real Concern' → 'climate'; 'Feminist Movement' → 'feminist', 'feminism'; 'Legalization of Abortion' → 'abortion', 'aborting'. Further features, which are not used for the final run, are discussed in Section 3.2.

Preprocessing
Twitter-based tokenisation is performed with twokenize 4 . Afterwards, tokens are normalised to lower case and stopwords are filtered, using the nltk 5 English stopword list, punctuation characters, plus Twitter-specific stopwords. The latter is manually created and consists of: "rt", "#semst", "thats", "im", "'s", "...", "via", "http". The first seven have to be an exact token match, the last one has to match the beginning of a token. Finally, phrases are detected, using an unsupervised method that creates 2-grams of commonly occurring expression such as "hillary clinton", "donald trump", "hate muslims" (Mikolov et al., 2013) 6 . The phrase detection model is trained on all tweets except the test tweets. At application time, if two subsequent tokens are identified as a phrase, those tokens are merged to one token (i.e. "donald", "trump" → "donald trump").

Experimental Setup
Our development setup is to train on all labelled tweets for the targets "Climate Change is a Real Concern", "Feminist Movement" and "Legalization of Abortion", then evaluate on "Hillary Clinton" tweets. The motivation for this is that Hillary Clinton is the most semantically related target to the Task B test target Donald Trump, since both entities are persons and politicians. For final submission we tuned all settings with this setup, then retrained on all data and applied the model to the test data.

Methods
Our goal is to determine if including the target is beneficial and if so, how best to include the it. To this end, the following features are evaluated: • Aut-twe: the autoencoder is applied to the tweet only • Aut-twe tar: the autoencoder is applied to the tweet and the target, the target features are concatenated with the tweet features • Aut-twe*tar: the autoencoder is applied to the tweet and the target, and the outer product of the tweet and target features is used • InTwe: A boolean "targetInTweet" feature We evaluate the impact of traditional sentiment analysis gazetteer features, extracted by assessing appearance of each word of the tweet in the gazetteers:  In addition we experiment with substituting the bag of word autoencoder with a word2vec model trained on the same data. We trained a skip-gram model with a dimensionality of 300, 10 min words and a context of 10 with the gensim implementation of word2vec 9 . Word vectors are combined by multiplication to get a fixed-length sentence-level vector. We also report a bag-of-words baseline, which disregards the unlabelled data and extracts unigram and bigram bag-of-word features from the training data. For the word2vec models as well as the bagof-words baseline, the same preprocessing as for the autoencoder approach is used.

Results
Results are reported in Tables 1 and 2 for all the experiments above, using the dev setup (Hillary Clinton) and the test setup (Donald Trump). Overall re-  sults for dev are significantly better than for test, and F1 for AGAINST is consistently higher than for FA-VOR. Performance increases for test with respect to baselines are much smaller than for dev. The best results for the dev set are achieved with Aut-twe+inTwe, and it was chosen for the final run on the test set. However, the best results on the test set are achieved with Aut-twe+inTwe+Emo, which is almost on par with the BoW baseline. The feature that contributes positively to both dev and test performance is inTwe. It was introduced because almost all tweets in the training data that contain the target either FAVOR or are AGAINST the target, but are rarely neutral towards the target. 363 out of 656 Hillary Clinton dev tweets contain the target, and 309 out of 689 Donald Trump test tweets contain the target. We observed that there is a significant difference in performance between tweets containing the target and tweets which do not contain the target (see Tables 3 and 4).   Adding autoencoder features for the target did not improve results for dev. For test, tweet features aggregated with target features slightly outperform target features on their own. As for traditional sentiment analysis features, Emo improves macro F1 for test, but not but dev, and Aff does not improve macro F1 for either of them.

Conclusions and Future Work
To conclude, we showed that it is important to detect if the target is mentioned in the tweet, and that a bag-of-word autoencoder can help to detect stance towards unseen targets. Further, developing a stance detection method for new targets without any labelled training data is challenging -we found that there are some discrepancies between what features perform well for a development versus a test set. In future work we will investigate how to better incorporate the target for stance detection, as this targetdependence is crucial in capturing that the same tweet can have different stance with respect to different targets that are not mentioned in the tweet.