Will-They-Won’t-They: A Very Large Dataset for Stance Detection on Twitter

We present a new challenging stance detection dataset, called Will-They-Won’t-They (WT--WT), which contains 51,284 tweets in English, making it by far the largest available dataset of the type. All the annotations are carried out by experts; therefore, the dataset constitutes a high-quality and reliable benchmark for future research in stance detection. Our experiments with a wide range of recent state-of-the-art stance detection systems show that the dataset poses a strong challenge to existing models in this domain.


Introduction
Apart from constituting an interesting task on its own, stance detection has been identified as a crucial sub-step towards many other NLP tasks (Mohammad et al., 2017). In fact, stance detection is the core component of fake news detection (Pomerleau and Rao, 2017), fact-checking (Vlachos and Riedel, 2014;Baly et al., 2018), and rumor verification (Zubiaga et al., 2018b).
Despite its importance, stance detection suffers from the lack of a large dataset which would allow for reliable comparison between models. We aim at filling this gap by presenting Will-They-Won't-They (WT-WT), a large dataset of English tweets targeted at stance detection for the rumor verification task. We constructed the dataset based on tweets, since Twitter is a highly relevant platform for rumour verification, which is popular with the public as well as politicians and enterprises (Gorrell et al., 2019).
To make the dataset representative of a realistic scenario, we opted for a real-world application 1 https://en.wiktionary.org/wiki/will-they-won%27t-they 2 https://github.com/cambridge-wtwt/ acl2020-wtwt-tweets of the rumor verification task in finance. Specifically, we constructed the dataset based on tweets that discuss mergers and acquisition (M&A) operations between companies. M&A is a general term that refers to various types of financial transactions in which the ownership of companies are transferred. An M&A process has many stages that range from informal talks to the closing of the deal. The discussions between companies are usually not publicly disclosed during the early stages of the process (Bruner and Perella, 2004;Piesse et al., 2013). In this sense, the analysis of the evolution of opinions and concerns expressed by users about a possible M&A deal, from its early stage to its closing (or its rejection) stage, is a process similar to rumor verification (Zubiaga et al., 2018a).
Moreover, despite the wide interest, most research in the intersection of NLP and finance has so far focused on sentiment analysis, text mining and thesauri/taxonomy generation (Fisher et al., 2016;Hahn et al., 2018;El-Haj et al., 2018). While sentiment (Chan and Chong, 2017) and targetedsentiment analysis (Chen et al., 2017) have an undisputed importance for analyzing financial markets, research in stance detection takes on a crucial role: in fact, being able to model the market's perception of the merger might ultimately contribute to explaining stock price re-valuation.
We make the following three contributions. Firstly, we construct and release WT-WT, a large, expert-annotated Twitter stance detection dataset. With its 51,284 tweets, the dataset is an order of magnitude larger than any other stance detection dataset of user-generated data, and could be used to train and robustly compare neural models. To our knowledge, this is the first resource for stance in the financial domain. Secondly, we demonstrate the utility of the WT-WT dataset by evaluating 11 competitive and state-of-the-art stance detection models on our benchmark. Thirdly, we annotate a further  M&A operation in the entertainment domain; we investigate the robustness of best-performing models on this operation, and show that such systems struggle even over small domain shifts. The entire dataset is released to enable research in stance detection and domain adaptation.

Building the WT-WT Dataset
We consider five recent operations, 4 in the healthcare and 1 in the entertainment industry (Table 1).

Data Retrieval
For each operation, we used Selenium 3 to retrieve IDs of tweets with one of the following sets of keywords: mentions of both companies' names or acronyms, and mentions of one of the two companies with a set of merger-specific terms (refer to Appendix A.1 for further details). Based on historically available information about M&As, we sampled messages from one year before the proposed merger's date up to six months after the merger took place. Finally, we obtain the text of a tweet by crawling for its ID using Tweepy 4 .

Task Definition and Annotation Guidelines
The annotation process was preceded by a pilot annotation, after which the final annotation guidelines were written in close collaboration with three domain experts. We followed the convention in Twitter stance detection (Mohammad et al., 2017) and considered three stance labels: support, refute and comment. We also added an unrelated tag, obtaining the following label set: 1. Support: the tweet is stating that the two companies will merge.
[CI_ESRX] Cigna to acquire Express Scripts for $52B in health care shakeup via usatoday 3 www.seleniumhq.org 4 www.tweepy.org/ 2. Refute: the tweet is voicing doubts that the two companies will merge.
[AET_HUM] Federal judge rejects Aetna's bid to buy Louisville-based Humana for $34 billion 3. Comment: the tweet is commenting on merger, neither directly supporting, nor refuting it.
[CI_ESRX] Cigna-Express Scripts deal unlikely to benefit consumers 4. Unrelated: the tweet is unrelated to merger.
[CVS_AET] Aetna Announces Accountable Care Agreement with Weill Cornell Physicians The obtained four-class annotation schema is similar to those in other corpora for news stance detection (Hanselowski et al., 2018;Baly et al., 2018). Note that, depending on the given target, the same sample can receive a different stance label: • Merger hopes for Aetna-Humana remain, Anthem-Cigna not so much.
[AET_HUM] → support [ANTM_CI] → refute As observed in Mohammad et al. (2017), stance detection is different but closely related to targeted sentiment analysis, which considers the emotions conveyed in a text (Alhothali and Hoey, 2015). To highlight this subtle difference, consider the following sample: • [CVS_AET] #Cancer patients will suffer if @CVSHealth buys @Aetna CVS #PBM has resulted in delfays in therapy, switches, etc -all documented. Terrible! While its sentiment towards the target operation is negative (the user believes that the merger will be harmful for patients), following the guidelines, its stance should be labeled as comment: the user is talking about the implications of the operation, without expressing the orientation that the merger will happen (or not). Refer to Appendix A.2 for a detailed description of the four considered labels.

Data Annotation
During the annotation process, each tweet was independently labeled by 2 to 6 annotators. Ten experts in the financial domain were employed as annotators 5 . Annotators received tweets in batches of 2,000 samples at a time, and were asked to annotate no more than one batch per week. The entire annotation process lasted 4 months. In case of disagreement, the gold label was obtained through   majority vote, discarding samples where this was not possible (0.2% of the total).
To estimate the quality of the obtained corpus, a further domain-expert labeled a random sample of 3,000 tweets, which were used as human upperbound for evaluation (Table 4). Cohen's κ between those labels and the gold is 0.88. This is well above the agreement obtained in previously released datasets where crowd-sourcing was used (the agreement scores reported, in terms of percentage, range from 63.7% (Derczynski et al., 2017) to 79.7% (Inkpen et al., 2017)). Support-comment samples constitute the most common source of disagreement between annotators: this might indicate that such samples are the most subjective to discriminate, and might also contribute to explain the high number of misclassifications between those classes which have been observed in other research efforts on stance detection (Hanselowski et al., 2018). Moreover, w.r.t. stance datasets where unrelated samples were randomly generated (Pomerleau and Rao, 2017;Hanselowski et al., 2018), we report a slightly 6 The average κ was weighted by the number of samples annotated by each pair. The standard deviation of the κ scores between single annotator pairs is 0.074. higher disagreement between unrelated and comment samples, indicating that our task setting is more challenging.

Label Distribution
The distribution of obtained labels for each operation is reported in Table 2. Differences in label distribution between events are usual, and have been observed in other stance corpora (Mohammad et al., 2016a;Kochkina et al., 2018). For most operations, there is a clear correlation between the relative proportion of refuting and supporting samples and the merger being approved or blocked by the US Department of Justice. Commenting tweets are more frequent than supporting over all operations: this is in line with previous findings in financial microblogging (Žnidaršič et al., 2018).

Comparison with Existing Corpora
The first dataset for Twitter stance detection collected 4,870 tweets on 6 political events (Mohammad et al., 2016a) and was later used in SemEval-2016 (Mohammad et al., 2016b). Using the same annotation schema, Inkpen et al. (2017) released a corpus on the 2016 US election annotated for multi-target stance. In the scope of PHEME, a large project on rumor resolution (Derczynski and Bontcheva, 2014), Zubiaga et al. (2015) stanceannotated 325 conversational trees discussing 9 breaking news events. The dataset was used in Ru-mourEval 2017(Derczynski et al., 2017 and was later extended with 1,066 tweets for RumourEval 2019 (Gorrell et al., 2019). Following the same procedure, Aker et al. (2017) annotated 401 tweets on mental disorders (Table 3).
This makes the proposed dataset by far the largest publicly available dataset for stance detection on user-generated data. In contrast with Mohammad et al. (2016a), Inkpen et al. (2017) and  PHEME, where crowd-sourcing was used, only highly skilled domain experts were involved in the annotation process of our dataset. Moreover, previous work on stance detection focused on a relatively narrow range of mainly political topics: in this work, we widen the spectrum of considered domains in the stance detection research with a new financial dataset. For these reasons, the WT-WT dataset constitutes a high quality and robust benchmark for the research community to train and compare performance of models and their scalability, as well as for research on domain adaptation. Its large size also allows for pre-trainining of models, before moving to domain with data-scarcity.

Experiments and Results
We re-implement 11 architectures recently proposed for stance detection. Each system takes as input a tweet and the related target, represented as a string with the two considered companies. A detailed description of the models, with references to the original papers, can be found in Appendix B.1. Each architecture produces a single vector representation h for each input sample. Given h, we predictŷ with a softmax operation over the 4 considered labels.

Experimental Setup
We perform common preprocessing steps, such as URL and username normalization (see Appendix B.2). All hyper-parameters are listed in Appendix B.1 for replication. In order to allow for a fair comparison between models, they are all initialized with Glove embeddings pretrained on Twitter 7 (Pennington et al., 2014), which are shared between tweets and targets and kept fixed during training.

Results and Discussion
Results of experiments are reported in Table 4. Despite its simple architecture, SiamNet obtains the best performance in terms of both averaged and weighted averaged F 1 scores. In line with previous findings (Mohammad et al., 2017), the SVM model constitutes a very strong and robust baseline. The relative gains in performance of CrossNet w.r.t. BiCE, and of HAN w.r.t. TAN, consistently reflect results obtained by such models on the Se-mEval 2016-Task 6 corpus (Xu et al., 2018;Sun et al., 2018).
Moving to single labels classification, analysis of the confusion matrices shows a relevant number of misclassifications between the support and comment classes. Those classes have been found difficult to discriminate in other datasets as well (Hanselowski et al., 2018). The presence of linguistic features, as in the HAN model, may help in spotting the nuances in the tweet's argumentative structure which allow for its correct classification. This may hold true also for the refute class, the least common and most difficult to discriminate. Unrelated samples in WT-WT could be about the involved companies, but not about their merger: this makes classification more challenging than in datasets containing randomly generated unre-lated samples (Pomerleau and Rao, 2017). SVM and CharCNN obtain the best performance on unrelated samples: this suggests the importance of character-level information, which could be better integrated into future architectures.
Concerning single operations, CVS_AET and CI_ESRX have the lowest average performance across models. This is consistent with higher disagreement among annotators for the two mergers.

Robustness over Domain Shifts
We investigate the robustness of SiamNet, the best model in our first set of experiments, and BiCE, which constitutes a simpler neural baseline (Section 3.2), over domain shifts with a cross-domain experiment on an M&A event in the entertainment business.
Data. We collected data for the Disney-Fox (DIS_-FOXA) merger and annotated them with the same procedure as in Section 2, resulting in a total of 18,428 tweets. The obtained distribution is highly skewed towards the unrelated and comment class ( Table 2). This could be due to the fact that users are more prone to digress and joke when talking about the companies behind their favorite shows than when considering their health insurance providers (see Appendix A.2).  Table 5: Domain generalization experiments across entertainment (ent) and healthcare datasets. Note that the data partitions used are different than in Table 4.

Results.
We train on all healthcare operations and test on DIS_FOXA (and the contrary), considering a 70-15-15 split between train, development and test sets for both sub-domains. Results show SiamNet consistently outperforming BiCE. The consistent drop in performance according to both accuracy and macro-avg F 1 score, which is observed in all classes but particularly evident for commenting samples, indicates strong domain dependency and room for future research.

Conclusions
We presented WT-WT, a large expert-annotated dataset for stance detection with over 50K labeled tweets. Our experiments with 11 strong models indicated a consistent (>10%) performance gap between the state-of-the-art and human upperbound, which proves that WT-WT constitutes a strong challenge for current models. Future research directions might explore the usage of transformer-based models, as well as of models which exploit not only linguistic but also network features, which have been proven to work well for existing stance detection datasets (Aldayel and Magdy, 2019). Also, the multi-domain nature of the dataset enables future research in cross-target and crossdomain adaptation, a clear weak point of current models according to our evaluations.