tWT–WT: A Dataset to Assert the Role of Target Entities for Detecting Stance of Tweets

The stance detection task aims at detecting the stance of a tweet or a text for a target. These targets can be named entities or free-form sentences (claims). Though the task involves reasoning of the tweet with respect to a target, we find that it is possible to achieve high accuracy on several publicly available Twitter stance detection datasets without looking at the target sentence. Specifically, a simple tweet classification model achieved human-level performance on the WT–WT dataset and more than two-third accuracy on various other datasets. We investigate the existence of biases in such datasets to find the potential spurious correlations of sentiment-stance relations and lexical choice associated with the stance category. Furthermore, we propose a new large dataset free of such biases and demonstrate its aptness on the existing stance detection systems. Our empirical findings show much scope for research on the stance detection task and proposes several considerations for creating future stance detection datasets.


Introduction
Stance detection is a vital sub-task for fake news detection (Pomerleau and Rao, 2017), automated fact checking (Vlachos and Riedel, 2014;Ferreira and Vlachos, 2016), social media analysis (Zhang et al., 2017), analyzing online debates (Bar-Haim et al., 2017) and rumour verification (Derczynski et al., 2017;Gorrell et al., 2019). Furthermore, it is also an essential measure for progress in Natural Language Understanding, especially in the noisytext domain.
Over the recent years, several stance detection datasets have been proposed. These datasets, in turn, facilitated progress in stance detection research, with some systems achieving up to 93.7% accuracy (Dulhanty et al., 2019). However, most of these state-of-the-art systems are complex deep neural networks, making them difficult to interpret. Lack of explainability raises concern since previous works (Gururangan et al., 2018;Goyal et al., 2017;Cirik et al., 2018;Geva et al., 2019) on other tasks demonstrated that superficial dataset biases could result in inflated test-set performance. With this motivation, we carry out the first study analyzing several publicly available Twitter stance detection datasets. Our experiments reveal rampant biases in datasets through which even target-oblivious models can achieve impressive performance.
Various existing works have hinted at the presence of such dataset biases. For example, TAN model (Du et al., 2017) is a very competitive stance detection model.However, Ghosh et al. (2019) recently proved that TAN does not take advantage of target information at all. In RumourEval-2017(Derczynski et al., 2017, models delivered up to 0.74 accuracy without any knowledge of the target, being only short of 0.004 from the best model considering the context. Similarly, in RumourEval-2019(Gorrell et al., 2019, the runner-up model (Fajcik et al., 2019) observed a 0.43 decrease in accuracy by considering the target information. Schiller et al. (2020) discovered that stance detection models are prone to adversarial attacks of paraphrasing, spelling error and negation similar to other NLP tasks (Ribeiro et al., 2020). However, this is the first work providing a detailed insight into the alarmingly impressive performance of target-oblivious models.
Target plays a crucial role in deciding stance. Consider the example in Figure 1. Here, the tweet stance varies for the two targets. The existing datasets have very few examples with different target labels. Models can pick up on pseudo signals in the tweet content and shortcut the task without looking at the targets. These signals or biases occur due to inherent biases in our language and human nature. For example, certain lexical choices can correlate with their respective stance classes. Upon discovering and studying such correlations, we augment the WT-WT dataset addressing these issues and re-evaluate the stance detection systems.
We make the following contributions. We empirically demonstrate biases across a variety of Twitter stance detection datasets and carry out a detailed analysis of these datasets. Consequently, we propose a new large scale dataset free of such spurious cues and re-evaluate the stance detection systems to show the usefulness of this dataset.

Biases in Stance Detection Datasets
We first discuss the datasets considered (Section §2.1), followed by our experiments (Section §2.2) and analysis (Section §2.3).

Datasets Considered
We consider a wide variety of publicly available Twitter stance detection datasets including crosstarget, multi-target, rumour-claim variants of stance detection. These datasets have a diverse set of targets ranging from free-form sentences to fixed target entities.
Here, however, we only study the English Twitter stance detection tasks in fully supervised learning settings. Specifically, we consider 6 datasets -WT-WT (Conforti et al., 2020), SE16 (task-A) (Mohammad et al., 2016b,a) M-T (Sobhani et al., 2017), RE17 (Derczynski et al., 2017), RE19 (Gorrell et al., 2019) and Encryption (Addawood et al., 2017) with their statistics mentioned in Table 7. This table also reports the percentage of tweets in the entire dataset labelled for different targets (DT) given by DT /T in the last column. We can see that these datasets have very few tweets annotated for different targets. The M-T dataset's targets are a pair of politicians, and for each of its tweet-targets, the label is a pair of stances. We formulate detecting these two stance-pair as separate tasks for the experiments in the following section.

Performance of Target-Oblivious Models
Method: Given a tuple (tweet, target, stance), a target-oblivious classifier f (tweet) → stance is trained in a supervised setting. It is expected that such a classifier would generalize poorly for an unbiased dataset. We set this target-oblivious classifier as the standard Bert classifier (    Results on the other datasets are shown in Table  3. We compare these results with random guessing, predicting majority class and the target-aware Bert. Additionally, RE17, R19, and Encryption datasets are heavily skewed datasets, so Macro-F1 is the proposed metric (Gorrell et al., 2019).
The target-oblivious Bert delivers more than twothird classification accuracy consistently across all these datasets. This model achieves impressive performance for all metrics in SE16 and M-T datasets, while also performing significantly above majority class for datasets with skewed distributions on the Macro-F1 metric. The performance delivered by target oblivious Bert is also very close to the target-aware Bert model on every metric. These surprising numbers across all the datasets indicates the presence of spurious cues that encourages the models to bypass the need for looking at the target.

Dataset Analysis
After the finding from our previous section, we sought to discover the form in which spurious cues exists and use those findings to create a new dataset. We mainly consider the largest and most recent dataset, WT-WT for analysis. We first discuss target-independent lexical choices associated with stance, followed by target-independent sentimentstance correlations.  between tweet and stance following the exact same procedure as (Gururangan et al., 2018) after removing stopwords. Table 4 shows that top 5 stance-wise words along with the fraction of tweets containing those words. We observer that certain groups of target-independent lexicons are highly correlated with stances in some cases occurring in more 29% of the tweet. For Support and Refute classes respectively, we find the co-occurrence of indicative words for the status of merger, such as 'approves' or 'blocks'. The Comments relating to these health companies' mergers often talk about its impact, leading to the choice of lexicons containing words like 'healthcare' and 'mean' with this stance. Similarly, Unrelated tweets often talk about things related to the companies but unrelated to the merger operation such as 'stocks' or 'bids'.

Stance and tweet lexicons:
Sentiment-stance correlation: Stance detection differs from the sentiment analysis task (Mohammad et al., 2016b). However, we observe a strong correlation of sentiment with stance. Formally, we obtain a sentiment score between 0 (negative) and 1 (positive) for each tweet using XLNet model  trained on SST (Socher et al., 2013;Pang and Lee, 2005) and Imdb (Maas et al., 2011). The average sentiment scores of these tweets across Support, Refute, Comment and Unrelated stances were found to be 0.237, 0.657, 0.492 and 0.485 respectively, while their variance were 0.087, 0.056, 0.110, 0.108. The tweets with Support and Refute stance have strong negative or positive sentiment on average while for the other two is it neutral on average but having a high variance. These serve as strong evidences for stance-sentiment correlations. Thus sentiment and lexicons together are some of the spurious cues in WT-WT dataset. We found such cues in the remaining datasets, varying with their domains. For example, RE19 has a question mark in more than 75% 'query' stance tweets, while it is present only in 11% of the entire remaining dataset. Similarly 75% of tweets with 'deny' stance have highly negative sentiment of less than 0.1 score. In SE16 dataset, had 91.4% of tweets without any opinion 2 had 'None' stance despite the stance detection task being different from opinion mining task (Mohammad et al., 2016b).

The Targeted WT-WT (tWT-WT) dataset
With the understanding from the previous section, we propose a new stance detection dataset on which target-unaware models will not perform well. We use the following reasoning for creating the new dataset. If a tweet in the dataset has different stances depending on different targets, then simple tweet classification models will not be able to perform well. Thus we attempt to increase DT/T ratio from Table 7. Formally, we take the WT-WT dataset, which is the largest dataset of its kind, with high-quality experts labels of 0.88 Cohen-κ (Cohen, 1960), and generate new (tweet, target, stance) triplets in three ways. First, we attempt to remove the sentiment-stance correlation by making the stance-wise average sentiment neutral. The WT-WT dataset has 5 targets, one target for each merger. We introduce 5 new additional targets which are negations of the original ones. Formally, if the tweet has a Support (Refute) stance to the target CV S_AET , then its stance to the negated target N EG_CV S_AET will be inverted to Refute (Support). This is done only  for the two stance classes with non-neutral average sentiment score. Introducing such negated targets reduces their sentiment to near neutral. Second, we remove lexicon-stance correlations by creating multiple targets with different stances for each tweet. Formally, for each tweet t with only one labelled target tgt, if the tweet-target pair (t, tgt) has the stance = 'Unrelated', then pick a target tgt where tgt = tgt and add the tuple (t, tgt , U nrelated) to the dataset. Due to WT-WT data collection and annotation procedure, this will not generate any wrong labels. This augmentation reduces the lexicon-stance correlations, by having similar sets of lexicons introduced for different stances. Hence, it guarantees target-oblivious shortcuts to result in poor performance.
Last, we balance the target-wise classdistributions. For the tuples with 'Comment' and 'Unrelated' stances, we create a new tuple with inverted target (same as the first step) for 50% and 75% such examples randomly.
The resulting dataset contains 111596 tweettarget pairs each belonging to a stance class. Each merger has at least 10000 data points. The class distribution is also somewhat balanced with more than 10k examples for the least occurring class. Among the tweet-target pairs, the pairs classified as Support, Refute, Comment and Unrelated are distributed in the ratio 1:1:3:5 approximately, having a similar distribution to the WT-WT dataset.

Re-evaluating stance detection systems
We propose a similar cross-target evaluation setting for tWT-WT as WT-WT. For the in domain (health) mergers, we train on three health merger (total six targets including negated target for each merger) and test on the fourth health merger. For the out-of-domain evaluation, we train on the eight targets corresponding to the 4 health mergers and test on the two targets for entertainment merger.
We re-evaluate the existing stance detection models on tWT-WT dataset. We consider Bert (with target), target-oblivious Bert from §2.2, along with the two strongest baselines from the WT-WT paper -SiamNet (Santosh et al., 2019) and TAN (Du et al., 2017). For SiamNet and TAN models, we replace the Glove (Pennington et al., 2014) and LSTM (Hochreiter and Schmidhuber, 1997) features with better features from Bert. Table 5 shows the performance of these models. Bert (no-target) gives very low performance, showing that target oblivious models perform poorly on this dataset. Similarly, TAN which has been proven to not at take advantage of the target information (Ghosh et al., 2019) also performs very poorly on the dataset. The target aware Bert offers a competitive performance still being only at 0.51 F1 score. SiamNet follows next at 0.31 F1 score. Both these models have their performance reduced significantly from WT-WT dataset.

Conclusion and Future Work
In this paper we demonstrated the presence of biases across several Twitter stance detection datasets, which aid simple tweet classifiers to achieve impressive performance. We carried out an investigation for presence of bias for the WT-WT dataset and found correlations of stance-class with sentiment and lexical choice. Consequently, we proposed a new bias-free stance detection dataset -tWT-WT, the largest of its kind. Evaluation of our baselines on this new dataset demonstrates scope for future research on stance detection. The observations are also crucial for the creation of new stance detection datasets. Our future work includes analysing multilingual datasets and exploring explainable target aware stance detection models.

A Appendix
We release our code and pre-training for section §2 at this urlhttps://github.com/ Ayushk4/bias-stance. Our dataset and baselines for section §3 have been released this urlhttps://github.com/Ayushk4/ stance-dataset. The Readme for the respective repositories contain instructions to set up, environment, replicating the codebase and the links to pre-trained models.
In this appendix, we first discuss the baselines §B. Followed by our experimental setup §C and the datasets considered §D B Baselines • SiamNet architecture (Santosh et al., 2019) is shown in Figure 4a. It uses siamese networks (Bromley et al., 1993) to learn representations each for tweet and targets and classify using the bottleneck of a single scalar being the similarity function. Similar to the WT-WT paper (Conforti et al., 2020) we find that the similarity function output scalar alone isn't strong enough feature for a classifier. Hence we concatenate the tweet and target representation vectors of the similarity function (inverse exponential of Manhattan distance) following (Mueller and Thyagarajan, 2016). We replaced the Glove and BiLstm with Bert embedding, where we obtain the tweet and target representations we taking the '[CLS]' vector representations from bert for those sentences.
• TAN (Du et al., 2017) is shown in Figure 4b. It uses a target-specific attention extraction over the tweet features obtained from BiLstm similar to (Dey et al., 2018). We use the same

C Experimental Setup
All our experiments were performed using Pytorch (Paszke et al., 2019), wandb (Biewald, 2020) and Huggingface (Wolf et al., 2019). The optimization algorithm used was the Adam optimizer (Kingma and Ba, 2014). We keep Bert layers and embedding trainable. In case of SiamNet and TAN, the Bert parameters are hard-shared. Experiments takes less than 10 minute per epoch and less than 5 GB GPU memory on a Tesla P100 GPU. The total model parameters for the Bert models are the same as the Bert (including being approximately same for SiamNet and TAN). Following the previous works demonstrating that domain-specific weights   (Nguyen et al., 2020) with the exception of the target aware Bert in tWT-WT where it was found to be unstable. So, we used bert-base-cased instead.

C.1 Hyperparameters
We use the Huggingface's Bert default config for our experiments. For the experiments, we tuned the learning rate from 5 values -{1e-6, 3e-6, 1e-5, 3e-5, 1e-4} and number of epochs from 3 values -{2, 5, 10} on the development set. The batch-size was fixed to 16. For the datasets with no development split, we use 5-fold cross validation. We evaluated and trained our model in the same settings as proposed for their respective datasets.

C.2 Preprocessing
We use ekphrasis library (Baziotis et al., 2017) to perform the preprocessing. We do word tokenization and spelling correction. We also remove URLs, Emoji, non-ascii characters and do normalization to limit the length of inputs. For the RumourEval2019 dataset we trimmed the input to 99 tokens, since Reddit text can even cross 500 characters length.

D Datasets
For the datasets that released only the tweet ids, we obtain the tweet text using the Twitter API. 3 However, some tweets are not accessible over time as accounts or tweets get banned/blocked/deleted etc.  Each of the tweet-target pair is labelled for one of the 3 stance from {Against, None, Favor}. The dataset has somewhat balanced class distribution and the metrics considered for this were Accuracy and F 1 . In this task, the models were evaluated on same targets as they were trained.
• Multi-target (M-T) stance dataset (Sobhani et al., 2017) contains 4455 tweets from political domain. Target was a fixed entity pair containing two political entities -Hillary-Sanders, Hillary-Trump, Cruz-Trump. This tweet-target pair has two stances -one for each target from -{Against, None, Favor}. Thus, in pairs of two, it leads to 9 total possible combinations of the 3 labels. We treat the pair as two separate problems, and trained two separate models for it. The dataset has somewhat balanced class distribution and the models were evaluated on the same targets (pairs) as they were trained using the Accuracy and F 1 metrics.
• RumourEval 2017 (Derczynski et al., 2017), was a rumour-stance detection task that proposed a new dataset for the task. The dataset consisted of 285 rumoured tweet threads with a total of 4519 tweets. The root node of each thread was the rumour target, for which users replied and created a response thread exhibiting a tree structure. The tweet-targets pairs were labelled from one of the four stance classes being -{Support, Query, Comment, Deny}. The dataset has a very skewed distribution with the majority class (Comment) having about 80% examples. So, Macro-Averaged F 1 score is a suitable metric. Here the models were evaluated on different threads (and hence different targets) as they were trained.
• RumourEval 2019 (Gorrell et al., 2019) was similar to the RumourEval 2017 task. It extended the dataset to include Reddit threads from selected sub-reddits. The resulting dataset has a total of 8574 datapoints. The tweet-targets pairs were labelled from the same labelset from one of the four stance classes being -{Support, Query, Comment, Deny}. This dataset has a very skewed distribution with the majority class (Comment) having about 80% examples. So, Macro-Averaged F 1 score is a suitable metric for the dataset.
• Encryption Debate dataset (Addawood et al., 2017) consists of 2999 tweets labelled for three stances -{For, Against, Neutral} on one encryption debate topic. We observed repeated entries in the dataset, including some having conflicting labels for the same tweettarget pair; we excluded such tweets for our experiments. Additionally, only 5 tweets from the dataset belonged to the 'against' class.
Since 5 examples is a very small number for most machine learning models to learn, we exclude this class for our analysis. The dataset has a very skewed distribution with the majority class (neutral) having about 86% examples. So, Macro-Averaged F 1 score is a suitable metric. The dataset has only one target for training and evaluating the models.