STANDER: An Expert-Annotated Dataset for News Stance Detection and Evidence Retrieval

We present a new challenging news dataset that targets both stance detection (SD) and fine-grained evidence retrieval (ER). With its 3,291 expert-annotated articles, the dataset constitutes a high-quality benchmark for future research in SD and multi-task learning. We provide a detailed description of the corpus collection methodology and carry out an extensive analysis on the sources of disagreement between annotators, observing a correlation between their disagreement and the diffusion of uncertainty around a target in the real world. Our experiments show that the dataset poses a strong challenge to recent state-of-the-art models. Notably, our dataset aligns with an existing Twitter SD dataset: their union thus addresses a key shortcoming of previous works, by providing the first dedicated resource to study multi-genre SD as well as the interplay of signals from social media and news sources in rumour verification.


Introduction
Starting from early work by Agrawal et al. (2003), Stance Detection (SD) has gained increasing interest from the research community (Zubiaga et al., 2018a). Recent work in SD has mostly focused on modeling user-generated data (Mohammad et al., 2017;Küçük and Can, 2020). However, SD on complex and articulated texts, such as news articles, has been considerably less studied, mainly due to the scarcity of published datasets (Pomerleau and Rao, 2017;Hanselowski et al., 2019). Moreover, research on user-generated SD and news SD has proceeded on parallel and independent tracks, neglecting the deep mutual influence that exists between social media and news sources (Canter, 2015;Kostkova et al., 2017).
In this paper, we seek to fill this gap, introducing STANDER (STANce Detection & Evidence Retrieval), a new expert-annotated dataset which is labeled for both news SD and fine-grained ER. STANDER collects news articles in English from high-reputation sources which discuss four recent mergers and acquisitions (M&A) operations between major healthcare companies in the US (Table 1). The term M&A refers to the process by which the ownership of a company (the target) is transferred to another company (the buyer). An M&A process (merger) ranges from informal talks between the companies to the closing of the deal; high secrecy is involved and discussions are usually not publicly disclosed during its early stages (Bruner and Perella, 2004). Thus, the analysis of the evolution of opinions and concerns about a potential merger is a process similar to rumor verification (Zubiaga et al., 2018b).
Notably, the news articles in STANDER discuss the same targets as in WT-WT (Conforti et al., 2020), a large Twitter SD dataset: thus, their union provides aligned signals from both authoritative (articles) and user-generated (tweets) sources, constituting the first resource of this kind for SD.
In this paper, we make the following contributions: (1) We construct STANDER, a large expertannotated news dataset 1 labeled for SD and finegrained ER. To our knowledge, it is the first news SD dataset to provide evidence snippets, along with their exact location in the corresponding article.
(2) We provide detailed statistics of our data, as well as the first diachronic analysis of the sources of disagreement among annotators in a SD paper, shedding light on the potential correlation between uncertainty in the world and increased ambiguity in journalistic prose. This suggests that considering  SD in a controlled domain, such as mergers, could allow model builders to develop deeper insights into the factors influencing model performance.
(3) We report results obtained for several stateof-the-art models on our dataset, and show that STANDER constitutes a challenging benchmark for future research in SD, ER and multi-task learning.
(4) We provide a correlation analysis of the articles from STANDER and the tweets from WT-WT, observing a moderately strong correlation. While the interplay between social media and news sources has been widely studied in other research fields, such as journalism studies (Johnson et al., 2018;Orellana-Rodriguez and Keane, 2018), very little work exists in computer science (Dredze et al., 2016), and notably, none considering SD.

Background
The Task. SD is the task of automatically identifying the opinion expressed in a text with respect to a target (Mohammad et al., 2017). Note that SD constitutes a related, but different task than both sentiment analysis and textual entailment. The first considers the emotions conveyed in a text (Alhothali and Hoey, 2015;Tang et al., 2016), while in the second, the goal is to predict whether a logical implication exists between two sentences (Bowman et al., 2015). Consider the following example: • Target: Aetna will merge with Humana • Text: Aetna & Humana CEOs met again to talk about deal, can't stand those bla-bla people!!! The text's sentiment is negative, as the author is complaining about the meeting; concerning entailment, it is positive: the target entails the text because, in order to merge, two companies need to discuss the deal; finally, its stance is commenting, as it is just talking about the merger, without expressing the orientation that it will happen (or not). SD as a Sub-Task. SD is often integrated into rumor verification (Zubiaga et al., 2018b), as testified by popular shared tasks (Derczynski et al., 2017;Gorrell et al., 2018). Starting from Vlachos and Riedel (2014), SD has been identified as a key step in fake news detection (Lillie and Middelboe, 2019) and automated fact-checking (Popat et al., 2017;Thorne and Vlachos, 2018;Baly et al., 2018): in this context, textual entailment is sometimes preferred to SD as the penultimate sub-step before verification (Thorne et al., 2018).
News SD. At the time of writing, a very small number of SD datasets collecting news have been released, usually building on platforms originally developed by professional journalists, like Emergent (Ferreira and Vlachos, 2016) or Snopes (Hanselowski et al., 2019). Note that in Twitter SD, the task consists of defining the stance of a tweet with respect to a short target (usually a named entity like Hillary Clinton (Inkpen et al., 2017), or a concept like feminism (Mohammad et al., 2016)); in news SD, on the contrary, the input article is much longer than a tweet, and the target is a complete sentence (Hanselowski et al., 2018a).
Comparison with corpora for News SD. EMER- GENT (Ferreira and Vlachos, 2016) constitutes the first released corpus for news SD: it collects 300 targets and 2,595 articles (with an average of 8.6 articles/target), labeled using a 3-class classification schema. For the first edition of the Fake News Challenge (Pomerleau and Rao, 2017), it was enriched with randomly generated unrelated samples. Neither of the two corpora is annotated with evidences.
To our knowledge, the only other news dataset to be annotated for both SD and ER is that of Hanselowski et al. (2019), which annotates factchecking instances from the debunking website Snopes 2 . Our work differs in a number of aspects: • Statistics. While Snopes is larger in size, it provides relatively few samples per target (14,296 samples and 6,422 targets, with an average of 2.22 articles/target); STANDER, in contrast, collects 3,291 articles on 4 targets, with an average of more than 800 articles/target. • Annotators. Snopes is annotated by crowdsourcing, we employ domain-expert annotators. • Evidence Annotations. Snopes provides entire sentences as evidence; importantly, STANDER is the first to provide the exact start and end indices of evidence snippets inside the sentences ( Figure 4): this will enable future research on more fine-grained evidence extraction. • Multi-Genre. At the time of writing, almost all released SD corpora collect data from one genre, with a prevalence of user-generated content. Snopes constitutes the only exception, as some of the collected documents (11%) come from Facebook or Twitter. Note, however, that they do not provide aligned signals from news and user-generated sources for all considered targets, but only for a limited portion of them. In contrast, our news dataset is the first to completely align with an existing resource for Twitter SD, providing a relevant amount of samples from two genres for all considered targets (Section 6). This will open a number of interesting research directions: while adversarial domain adaptation -using data of the same news genre, but from another domain -proved to be useful for news SD (Xu et al., 2019), the impact of considering data of the same domain but from another genre has never been studied in SD.

Building the Dataset
In this section, we report on data collection and annotation, and provide a detailed analysis of the findings from the annotation process.

Data Retrieval
We consider four recent mergers involving US companies in the healthcare industry (Table 1). To retrieve news articles related to the mergers, we used Factiva (Johal, 2009), a database by Dow Jones which collects more than 32,000 general and finance-specific sources, including newspapers, journals and magazines. For each merger, we searched for the involved companies and selected articles in English tagged as Acquisitions/Mergers/Shareholdings. We retrieved articles from one month before the first contact of the firms up to one month after any final decision on the merger. Refer to Appendix A for details on the crawl settings and the crawling timeline.

Annotation Guidelines
The annotation process was initiated by a pilot, after which the annotation guidelines were written in close collaboration with three domain experts. Extracts from the annotation guidelines are reported in Appendix A. Stance Annotation. Following Pomerleau and Rao (2017), we consider four stance labels: 1. Support: the article is voicing confidence that the two companies will merge. 2. Refute: the article is voicing doubts that the two companies will merge. 3. Comment: the article is talking about the merger, neither directly supporting, nor refuting it. 4. Unrelated: the article is unrelated to the merger.
Note that the article might be talking about one or both the considered companies, but without discussing their merger. Evidence Annotation. In addition to the stance label, annotators were asked to select the text snippets or sentences from the article which were determinant for them to classify its stance, which we refer to as evidence (Thorne and Vlachos, 2018).

Data Annotation
In line with previous work on news SD (Vlachos and Riedel, 2014;Ferreira and Vlachos, 2016), in which data was labeled by professional journalists, we rely on domain experts for annotation. Specifically, we provided articles to eight economists 3 in batches and asked them to annotate no more than 100 articles per day 4 ; the annotation process lasted 4 months. Each article was independently labeled by 2 to 4 annotators.
To aggregate stance labels, we used majority voting. For evidence snippets, we merged the provided snippets to obtain a list of selected evidences; a further annotator, who did not take part in the first phase, manually checked the overlapping snippets.

Analysis of Annotators' Disagreement
The most common source of disagreement between annotators is on support/comment ( Figure 1): note that, sometimes, the given stance depends on subtle nuances in the article's argumentative structure and it is therefore somehow subjective; such samples are difficult to discriminate for ML systems as well (Riedel et al., 2017). With respect to datasets with randomly generated unrelated samples (Pomerleau and Rao, 2017;Hanselowski et al., 2018b), we report a slightly higher unrelated/comment disagreement between annotators, which reflects the higher complexity of the task in our setting.
To further understand the sources of disagreement between annotators, we perform a diachronic analysis of the samples which received different labels and their time of publication. As shown in Figure 2, a correlation exists between some relevant events (such as the first joint press release) and the number of articles published. However, a higher volume of articles does not always correlate with higher disagreement rates between annotators: interestingly, it seems that some events (such as the merger agreement) spread more uncertainty around the merger than others (such as the start of the antitrust trial). This uncertainty is transmitted to the press, resulting in journalists writing speculative articles: such articles seem to be more prone to the reader's subjective biases, eventually producing a higher inter-annotator disagreement.
The interplay of different layers of uncertainty until the resolution of the event (i.e. confirmation of merger talks or the complaint before the DOJ) makes our domain choice particularly insightful for model builders.

Quality Assessment
To assess the quality of our dataset, we asked a domain expert to annotate a random 10% of the samples, which are used as an upper bound for evaluation. First, she received targets together with the gold evidence snippets selected in the first annotation round; in a second phase, the same annotator received the complete articles and was asked to re-annotate the samples. In the former setting and similar to Hanselowski et al. (2019), we wanted to assess whether the selected evidence snippets alone are sufficient to provide the correct stance: the Cohen's κ between those labels and the gold is 75.2, which is substantial (Cohen, 1960) and reflects the good quality of the extracted snippets. Cohen's κ obtained when considering the entire article texts is 59.5 (moderate).
This drop testifies that: (1) SD on long, unstructured texts is complex and more prone to subjective biases than SD on evidence snippets; interestingly, a similar low inter-annotator agreement (Fleiss' κ of 0.55, (Fleiss, 1971)) has been observed also for the related news articles in the Fake News Challenge dataset (Hanselowski et al., 2018a), which does not contain annotation of evidences; unfortunately, Hanselowski et al. (2019) do not report on the agreement considering the entire sample texts; (2) therefore, providing evidence annotation is fundamental to building a reliable dataset that can be used to train supervised stance classifiers.

Desiderata and Challenges
Notably, STANDER satisfies all four desired properties outlined in Mohammad et al. (2017): 1. Topics should be commonly understood by a wide number of people. We consider some of the major US healthcare providers, with which almost everyone has interacted at different levels (insurers, pharmacy chains, ...): thus, not only finance experts (example (a) in Table 3) and local sources (b), but also politicians (c), physicians (d), policymakers (e) and the general public are interested in their outcome, resulting in a dataset which collects different registers. 2. The topics convey different opinions, producing a significant amount of data for all stance labels. The considered mergers are controversial, because their outcome might change the US healthcare landscape; moreover, as they happened during the change from the Obama to the Trump administration, with the introduction and partial rollback of Obamacare, there is considerable interference with politics (f). 3. The dataset contains indirect references to the targets, as when the involved companies are not explicitly mentioned: for example, given a merger between A and B, if a source states that A is in talk with C, this implicitly undermines the likelihood of the A-B merger to happen (g). 4. The dataset contains samples where the target of opinion is different. This is the case of articles that discuss about one or both companies,    Figure 4: A supporting, a commenting and a refuting sample from STANDER (evidence snippets underlined).
without taking a stance on their merger (h). Moreover, as the mergers happened simultaneously, there is considerable interplay between companies; a successful classifier thus requires the modeling of the deep relationship between the target merger and the article, not just simple keyword matching (i, j).
In addition, the task is challenging as the underlying argumentative structure is needed in order to correctly classify the article. Considering the support example in Figure 4, both the title and the body contain the same information. In the comment example, the evidence is in the title, while the body provides additional information. In the refute example the evidence is in body while the title does not contain information regarding the stance.
These characteristics contribute to making STANDER a challenging benchmark for news SD.

Corpus Statistics
Dataset Statistics. The final dataset collects 3,291 labeled news articles from heterogeneous news outlets ( Figure 5): while finance-specific publications constitute the majority of the most frequent sources, the corpus also contains many general newspapers (such as Reuters News or The New York Times) as well as local journals (such as the Louisville Business First). News articles present an asym-   metric and hierarchical structure: they are formed by a concise and short title and a (usually) long body, which in turn is composed of ordered paragraphs ( Table 2). Note that, while articles might be very long (Figure 8), evidences are usually located in the title or in the first few paragraphs ( Figure 6). This is in line with the inverted pyramid (Scanlan, 2000;Pöttker, 2003) or summary news lead (Errico et al., 1997) style -widely adopted in modern journalistic prose -in which the most relevant information is concentrated at the beginning of the article.

Label Distribution.
A clear correlation can be observed between the merger's outcome (blocked/succeeded) and the relative proportion of supporting and refuting samples (Table 3). Contrary to many popular SD datasets (Derczynski et al., 2017;Pomerleau and Rao, 2017;Hanselowski et al., 2018b), the related labels present a relatively balanced distribution: this is in line with property (2) (Section 4.1); however, in contrast to Mohammad et al. (2017), who employed query keywords to "force" it, such a balanced distribution arose naturally from our data.

Baselines and Discussion
This section provides results for a number of recent techniques. While more complex models could possibly achieve better results, our aim was to set baselines for our dataset with a number of strong models. Detailed description of the experimental setting is provided in Appendix B and C for replication.

Experiments
Models. We consider two dummy baselines -a random and a majority vote baseline -and, following Hanselowski et al. (2019), three neural baselines: BertEmb, an MLP leveraging sentence-BERT embeddings (Reimers and Gurevych, 2019); UseEmb, an MLP leveraging Universal Sentence Encoder's sentence embeddings (Cer et al., 2018); and a BiLSTM over Glove embeddings (Pennington et al., 2014). As upper bound, we consider the performance of a domain expert against the aggregated gold data (see Section 3, Quality Assessment, for further details). Experimental Setting. We first test the models' ability to perform SD given the correct set of sentences which contain an evidence snippet (SD in isolation). Secondly, we consider both SD and ER: while the tasks could be approached with a   pipelined strategy (as in Thorne et al. (2018)), we follow a multi-task training approach, which has proven to be more effective (Yin and Roth, 2018). When jointly training, we employ a simple ER strategy, by taking the title and the first 4 paragraphs from each article as candidates.
We train in a cross-target setting (train on three mergers, test on the fourth), and consider two training settings: first, we select only related samples, which present a balanced distribution (Table 3); then, we consider all stances: this is more difficult because unrelated samples are very infrequent, resulting in a skewed distribution as in RumourEval (Derczynski et al., 2017;Gorrell et al., 2018). To account for performance fluctuations (Reimers and Gurevych, 2017), we run 5 simulations for each model and take the average of the results. We leave the identification of the evidence's indices in the sentences, as well as the usage of more sophisticated ER methods, to future work.
Evaluation. We follow recent work (Thorne et al., 2018;Hanselowski et al., 2018bHanselowski et al., , 2019 and consider macro-averaged precision, recall and F 1 for SD, and precision and recall on the 5 selected evidence candidates (P@5 and R@5) for ER.

Results and Error Analysis
Results of the experiments are reported in Table 4. As expected, we observe a drop in performance when considering only related vs all classes. While all considered models obtain significant gains over the two dummy baselines, the BertEmb model -as observed also in Hanselowski et al. (2019) -obtains the best results overall for SD. Note, however, the wide gap between BertEmb performance and the upper bound, which confirms the difficulty of our dataset. Considering ER results, we observe a smaller gap in performance between models, with UseEmb obtaining the best results overall.
Interestingly, we observe a gain in stance classification when BertEmb is jointly trained to perform both SD and ER: this seems to indicate that, by learning to classify whether an input sentence constitutes an evidence snippet or not, the system is indirectly gathering knowledge which is also useful to solve the SD task. An error analysis of BertEmb's predictions shows that most mis-classifications happen between the comment the support labels: this is in line with findings from both previous work (Riedel et al., 2017) and the analysis of the inter-annotator agreement (Section 3). A relatively high number of comment samples are also mis-classified as refute: note thatwhile in news SD corpora refuting samples coming from popular newspapers can sometimes be easily spotted by the presence of words such as fake, hoax, or similar -STANDER contains articles from high-reputation sources, which usually do not use sensationalist language.

Integrating News and Twitter Signal
As outlined in the Introduction, STANDER contains the same targets as the Twitter SD WT-WT corpus (Conforti et al., 2020). The union of both corpora thus provides a great opportunity for studying the interplay between authoritative and usergenerated signals: the first refers to long and articulated texts written by professional journalists, while the second refers to a very abundant but potentially noisy stream of posts, which are published without any editorial review. While a detailed time series analysis (Lim and Tucker, 2019) is beyond the scope of this paper, we provide a first data description and a correlation analysis, which show the potential of the obtained aligned corpus and the challenges it may pose to future research.

Statistical Analysis.
The relative frequency of samples between mergers is similar in both the news and the Twitter signals (Figure 7), with CI ESRX being the less popular target (refer to Conforti et al. (2020) for a detailed analysis of the WT-WT corpus). The same holds true for the relative distribution of related labels, with refuting samples being more frequent in the case of blocked mergers.
However, there are a number of differences between the two signals: notably, the Twitter signal presents a high number of noisy unrelated samples, which is not surprising when dealing with user-generated data (Zubiaga et al., 2015); we also observe a higher proportion of commenting samples, which has often been observed in financial microbloggings (Žnidaršič et al., 2018). On the contrary, the news signal is cleaner, but around one order of magnitude less abundant (Figure 7). Apart from this asymmetry in label distribution, a further asymmetry in length can be observed between the corpora: tweets tend to be short and compact, while pieces of news are long and articulated (Figure 8), thus posing interesting challenges for future work on multi-genre SD.

Signal Correlation
A diachronic analysis of the volume of tweets and articles discussing CVS AET (Figure 9) shows a relatively similar distribution between the two signals, with some notable differences. While the Twitter signal presents some constant but minor activity from the very beginning of the process, the news signal remains completely silent until the companies' views are reported by a major news outlet. For some of the mergers, we even observe a notable spike in the Twitter activity before, but close to the first merger's mention in the press (see the analysis of the ANTM CI merger in Appendix D). This is in line with studies on the usage of social media, especially Twitter, as sources of information for journalists (Van Leuven and Deprez, 2017;Rony et al., 2018;Johnson, 2019).
As reported in Table 5, the two signals exert moderate levels of correlation, which is further increased when only considering related tweets. This follows from the observation that large spikes in both the Twitter and the news signal are around dates of milestones within the merger process (Bruner and Perella, 2004;Piesse et al., 2013) and that many of the unrelated tweets occur before the first news article is published, when no activity is

Conclusions and Future Work
We presented STANDER, a new expert-annotated resource for news SD and ER. We provided a detailed description of the annotation process and corpus statistics, as well as of the findings from the annotation process. Our experiments with a set of strong models indicated a consistent (up to 30%) performance gap between SoA and human upper bound: this proves that our corpus constitutes a strong challenge and leaves plenty of room for future work on news SD, ER, domain adaptation and multi-task training. Moreover, our corpus enables future research in a number of new areas, including: fine-grained ER for news SD -where the goal is not only to retrieve evidence snippets, but also their exact location in the text -which goes in the direction of improving interpretability of a model's predictions; and multigenre SD -due to the fact that our corpus aligns with an existing resource for Twitter SD -which would open new interesting scenarios in the wider field of rumour verification. Below, we report a screenshot from the Factiva interface (Johal, 2009), while crawling for the CVS AET merger: A.2 Crawling Timelines Table 6 gives an overview of the considered M&A operations, their respective crawling timelines and the total number of articles.

A.3 Metadata Included in the Corpus
We provide a sample of the data in the Supplementary material. Each sample in the dataset is associated with the following fields: • Target merger; one from {CVS AET, CI ESRX, ANTM CI, AET HUM}.
• Stance of the article with respect to the target merger; one from {support, refute, comment, un-related}. • Title of the article, followed by a ordered list of the article's Paragraphs.
• A list of Evidence Snippets, indicating 1) the index of the paragraph in the article where the evidence is located; and 2) the exact start and end indices of the snippet in the corresponding paragraph.

A.4 Annotation Guidelines
The following is an extract from the annotation guidelines sent to the annotators. Each label description was correlated with a number of examples, which we don't report due to space limitation.
You will be sent a number of news articles. The annotation process consists of choosing one of 4 possible labels for each article and marking which part of the article (e.g., the title or a specific sentence, phrase, paragraph) led you to your assessment. The four labels to choose from are Support, Comment, Refute, and Unrelated.
Label: Support This label should be chosen if the article is supporting the theory that the merger is happening. That is, after reading the article the reader feels more confident that the two companies will merge. Articles that mention the merger as a fact and then talk about e.g. the implications or consequences of the merger should not be labelled as supporting but as commenting.
Label: Refute This label should be chosen if the article is refuting the theory that the merger is happening. That is, after reading the article the reader feels less confident that the two companies will merge. Articles that are voicing doubts or mention potential roadblocks (such as antitrust issues) should be labelled refute as well.
Label: Comment This label should be chosen if the article is commenting on one of the mergers. The article should neither directly state that the merger is happening, nor refute that it will be completed successfully. Articles that mention the merger as a fact and then talk about e.g. the implications or consequences of the merger should also be labelled as commenting. For articles that are long, presenting both positive and negative evidence, annotators should weigh the evidence and conclude whether the article is 'mostly' positive or negative. Only of the assessment of the annotator is that the evidence is equal should the article be labelled as commenting.
Label: Unrelated This label should be chosen if the article is unrelated to the merger in ques-tion. Since the articles have been collected from a news aggregation service, some of them may not in fact be about one of the mergers. This label will only have few articles and should be the easiest to identify. Note that an article that is mainly about a different topic/merger, but talks about the relevant merger in one paragraph or just a sentence, annotators should choose the label based on this paragraph or sentence.

B Baselines-Related Specifications
Below, we report on the implementations details for the baselines presented in Section 5. SD stands for Stance Detection and ER for Evidence Retrieval.

B.1 Dummy Baselines
Two dummy baselines have been considered as lower bound.
• Random Baseline. SD: outputs a random stance; ER: outputs two random sentences chosen from the title and all body's paragraphs. • Majority Baseline.
SD: always outputs support (the most frequent label in the corpus); ER: always outputs the title and the first paragraph (the most frequent locations of evidences in the dataset, Figure 6).

B.2 Neural Baselines
Three strong neural baselines, which obtained stateof-the-art results in previous work (Hanselowski et al., 2019), are considered for future reference.
Inputs. The models receive as input n + 1 sequences {t, s 1 , ..., s n }, where t is the target and {s 1 , ..., s n } is the list of n sentences from the articles. If training for SD in isolation, such sentences are the gold evidences; if jointly training for SD+ER), they are the evidence candidates: as a simple sentence retrieval method, we always retrieve the title and the first four paragraphs of the article, where evidence snippets are most frequently located in the corpus (Figure 6). For a target merger between companies A and B (with acronyms a and b), we employ as target a string containing the text: "A (a) will merge with B (b)." Encoders. We employ three neural encoders to obtain a target-aware representation h i of each input sentence s i : • BiLSTM. We employ 300-dimension word embeddings to encode each input token. The embedding matrix is initialized with Glove 5 embeddings (Pennington et al., 2014), which are kept fixed over training to prevent overfitting. We concatenate each input evidence with the target, and we obtain a hidden representation for each pair of inputs with a BiLSTM network with size of 128 hidden units. • UseEmb. We obtain sentence embeddings for each input sequence with the Universal Sentence Encoder (Cer et al., 2018), and we concatenate each input sentence with the input target. We use the large model for English 6 . We then pass the obtained encoded representation through a position-specific dense layer with 128 hidden units. • BERTEmb. We follow the same principle as above, but using Sentence-BERT (Reimers and Gurevych, 2019) to obtain sentence embeddings for each input sentence. We use the bert-base model trained on the SNLI and MultiNLI datasets 7 .
Decoders. After encoding, we obtain n representations {h 1 , ..., h n }, where h i is the target-aware representation of the sentence at position i. Inspired by Yin and Roth (2018), we obtain a probability α i ∈ (0, 1) of the sentence s i being an evidence as: where v is a learned parameter vector. To model the entire set of input sentences as a whole, we construct their joint representation e as: We then consider two decoders, depending on the task(s) we are training for (only SD, or SD+ER): • Only SD. We predict the stance label with a softmax operation over the stance tagset on e. • ER and SD. If we jointly perform both ER and SD with a multi-task training setting, we binarize the probability vector α = [α 1 , α n ] by rounding at 0.5; we consider all input sentence s i where α i > 0.5 as an evidence snippet. 5 We use 300-dimensional word embeddings pretrained on Wikipedia 2014 + Gigaword 5, https://nlp.stanford. edu/projects/glove/ 6 https://tfhub.dev/google/ universal-sentence-encoder-large 7 https://github.com/UKPLab/ sentence-transformers C Experimental Setting Specifications Data Preprocessing. We perform minimal data preprocessing. The following refers to the BiL-STM model: we include all types in the corpus without selecting any minimal frequency; for tokenization, we use NLTK's word tokenize tokenizer (Loper and Bird, 2002) 8 ; we pad/cut input sentences up to 10 tokens (in the case of the article's title) or 25 tokens (in the case of the article's paragraphs).
(Hyper)-Parameters and Runtime Specifications. Refer to Appendix B for a description of the considered models' architectures (completed with embedding size and number of hidden units per layer). We train all models with Adagrad setting the learning rate to 0.02. We train with batches of 32 samples for a maximum of 70 epochs, using Early Stopping with a patience of 10. To prevent overfitting, dropout of 0.2 has been used during training on all layers of the models.
Note that, given that this is a resource paper, our goal is to provide a set of robust baselines for future research. For this reason, we don't perform extensive hyper-parameter tuning on the selected models. Table 7 reports on the total number of (trainable) parameters for each considered model.  This resulted in the average runtime/step reported in Table 8 (the average runtime is calculated over five different runs of the same model, trained on the ANTM CI, AET HUM and CVS AET mergers).
Training Setting. All models are trained using cross-validation, testing on one merger and training on the other three.  To account for performance fluctuations (Reimers and Gurevych, 2017), we run 5 simulations for each model and take the average of the results, weighting according to the size of the collected articles for each merger. Table 9 reports the standard deviation between different runs of the same model. Interestingly, UseEmb is the most stable model for SD, while BertEmb is most stable for ER.  Table 9: Standard deviation between results obtained with the considered models over different runs. For each training setting (3 vs 4 classes) we first report σ on SD in isolation, then on jointly training SD+ER.
Computing Infrastructure. We run experiments on an NVIDIA GeForce GTX 1080 GPU.

D.1 Implementation Details
For the correlation analysis in Section 6, we used Panda's implementation of the Spearman correlation 11 (Wes McKinney, 2010). We calculate the standard error as: where r x is the correlation coefficient and n is the number of observations (i.e. the number of days of observations collected for each mergers). Figure 10 shows the distribution of tweets and articles over time for ANTM CI. Three distinct phases can be distinguished in the timeline of the merger. The first phase goes from the beginning of the data collection to the first report on the companies' talks which appeared on a major news outlet. During this phase, we observe minor movements in the Twitter signal and some sparse news articles. Considering only related samples, most of the tweets and articles in this phase disappear. However, at the end of this phase there are spikes in the Twitter signal. This suggests that during this period the ongoing talks between the companies are not publicly known, but at the very end information may be leaked. The tweet signal spikes during the first phase on 20.05.2015, around one month before the first news report.

D.2 The Case of the Anthem/Cigna Merger
The second phase begins with the first report by a major news outlet and lasts until the beginning of the antitrust process. It is characterised by large spikes in both the volume of tweets posted and the number of published articles. The first spike in the news articles occurs on 15.06.2015, when the Wall Street Journal -as it happens for most considered mergers -reports on the ongoing talks between the two companies. The second spike occurs on 24.07.2015, when the companies publicly announce the merger with a joint press release. These two spikes in the news signal are mirrored in the tweets. After the initial reporting about the two companies' intentions, most news articles and tweets discuss the implications of the merger, displaying a constant but not heavy activity.
The third phase begins with the antitrust process and lasts until the end of the merger's timeline. Spikes in the volume of tweets and articles can be observed around specific events, such as when the official antitrust complaint is presented to the Department of Justice (DOJ), at the start of the antitrust trial and around the date of the court decision. During this phase, spikes present a very similar distribution for both signals.