Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations

Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.


Introduction
Stance detection, automatically identifying positions on a specific topic in text (Mohammad et al., 2017), is crucial for understanding how information is presented in everyday life. For example, a news article on crime may also implicitly take a position on immigration (see Table 1).
There are two typical approaches to stance detection: topic-specific stance (developing topicspecific classifiers, e.g., Hasan and Ng (2014)) and cross-target stance (adapting classifiers from a related topic to a single new topic detection, e.g., Augenstein et al. (2016)). Topic-specific stance requires the existence of numerous, well-labeled training examples in order to build a classifier for a new topic, an unrealistic expectation when there are thousands of possible topics for which data collection and annotation are both time-consuming and expensive. While cross-target stance does not require training examples for a new topic, it does require human knowledge about any new topic and how it is related to the training topics. As a result, models developed for this variation are still limited Topic: immigration Stance: against Text: The jury's verdict will ensure that another violent criminal alien will be removed from our community for a very long period . . . in their ability to generalize to a wide variety of topics.
In this work, we propose two additional variations of stance detection: zero-shot stance detection (a classifier is evaluated on a large number of completely new topics) and few-shot stance detection (a classifier is evaluated on a large number of topics for which it has very few training examples). Neither variation requires any human knowledge about the new topics or their relation to training topics. Zero-shot stance detection, in particular, is a more accurate evaluation of a model's ability to generalize to the range of topics in the real world.
Existing stance datasets typically have a small number of topics (e.g., 6) that are described in only one way (e.g., 'gun control'). This is not ideal for zero-shot or few-shot stance detection because there is little linguistic variation in how topics are expressed (e.g., 'anti second amendment') and limited topics. Therefore, to facilitate evaluation of zero-shot and few-shot stance detection, we create a new dataset, VAried Stance Topics (VAST). VAST consists of a large range of topics covering broad themes, such as politics (e.g., 'a Palestinian state'), education (e.g., 'charter schools'), and public health (e.g., 'childhood vaccination'). In addition, the data includes a wide range of similar expressions (e.g., 'guns on campus' versus 'firearms on campus'). This variation captures how humans might realistically describe the same topic and con-trasts with the lack of variation in existing datasets.
We also develop a model for zero-shot stance detection that exploits information about topic similarity through generalized topic representations obtained through contextualized clustering. These topic representations are unsupervised and therefore represent information about topic relationships without requiring explicit human knowledge.
Our contributions are as follows: (1) we develop a new dataset, VAST, for zero-shot and few-shot stance detection and (2) we propose a new model for stance detection that improves performance on a number of challenging linguistic phenomena (e.g., sarcasm) and relies less on sentiment cues (which often lead to errors in stance classification). We make our dataset and models available for use: https://github.com/ emilyallaway/zero-shot-stance.

Related Work
Previous datasets for stance detection have centered on two definitions of the task (Küçük and Can, 2020). In the most common definition (topicphrase stance), stance (pro, con, neutral) of a text is detected towards a topic that is usually a nounphrase (e.g., 'gun control'). In the second definition (topic-position stance), stance (agree, disagree, discuss, unrelated) is detected between a text and a topic that is an entire position statement (e.g., 'We should disband NATO').
A number of datasets exist using the topicphrase definition with texts from online debate forums (Walker et al., 2012;Abbott et al., 2016;Hasan and Ng, 2014), information platforms (Lin et al., 2006;Murakami and Putra, 2010), student essays (Faulkner, 2014), news comments (Krejzl et al., 2017;Lozhnikov et al., 2018) and Twitter (Küçük, 2017;Tsakalidis et al., 2018;Taulé et al., 2017;Mohammad et al., 2016). These datasets generally have a very small number of topics (e.g., Abbott et al. (2016) has 16) and the few with larger numbers of topics (Bar-Haim et al., 2017;Gottipati et al., 2013;Vamvas and Sennrich, 2020) still have limited topic coverage (ranging from 55 to 194 topics). The data used by Gottipati et al. (2013), articles and comments from an online debate site, has the potential to cover the widest range of topics, relative to previous work. However, their dataset is not explicitly labeled for topics, does not have clear pro/con labels, and does not exhibit linguistic variation in the topic expressions. Furthermore, all of these stance datasets are not used for zero-shot stance detection due to the small number of topics, with the exception of the SemEval2016 Task-6 (TwitterStance) data, which is used for cross-target stance detection with a single unseen topic (Mohammad et al., 2016). In constrast to the Twitter-Stance data, which has only one new topic in the test set, our dataset for zero-shot stance detection has a large number of new topics for both development and testing.
For topic-position stance, datasets primarily use text from news articles with headlines as topics (Thorne et al., 2018;Ferreira and Vlachos, 2016). In a similar vein, Habernal et al. (2018) use comments from news articles and manually construct position statements. These datasets, however, do not include clear, individuated topics and so we focus on the topic-phrase definition in our work.
Many previous models for stance detection trained an individual classifier for each topic (Lin et al., 2006;Beigman Klebanov et al., 2010;Sridhar et al., 2015;Somasundaran and Wiebe, 2010;Hasan and Ng, 2013;Li et al., 2018;Hasan and Ng, 2014) or for a small number of topics common to both the training and evaluation sets (Faulkner, 2014;Du et al., 2017). In addition, a handful of models for the TwitterStance dataset have been designed for cross-target stance detection (Augenstein et al., 2016;Xu et al., 2018), including a number of weakly supervised methods using unlabeled data related to the test topic (Zarrella and Marsh, 2016;Wei et al., 2016;Dias and Becker, 2016). In contrast, our models are trained jointly for all topics and are evaluated for zero-shot stance detection on a large number of new test topics (i.e., none of the zero-shot test topics occur in the training data).

VAST Dataset
We collect a new dataset, VAST, for zero-shot stance detection that includes a large number of specific topics. Our annotations are done on comments collected from The New York Times 'Room for Debate' section, part of the Argument Reasoning Comprehension (ARC) Corpus (Habernal et al., 2018). Although the ARC corpus provides stance annotations, they follow the topic-position definition of stance, as in §2. This format makes it difficult to determine stance in the typical topic-phrase (pro/con/neutral) setting with respect to a single topic, as opposed to a position statement (see Topic and ARC Stance columns respectively, Table 2). Therefore, we collect annotations on both topic and stance, using the ARC data as a starting point.

Topic Selection
To collect stance annotations, we first heuristically extract specific topics from the stance positions provided by the ARC corpus. We define a candidate topic as a noun-phrase in the constituency parse, generated using Spacy 1 , of the ARC stance position (as in (1) and (5) Table 2). To reduce noisy topics, we filter candidates to include only nounphrases in the subject and object position of the main verb in the sentence. If no candidates remain for a comment after filtering, we select topics from the categories assigned by The New York Times to the original article the comment is on (e.g., the categories assigned for (3) in Table 2 are 'Business', 'restaurants', and 'workplace'). From these categories, we remove proper nouns as these are over-general topics (e.g., 'Caribbean', 'Business'). From these heuristics we extract 304 unique topics from 3365 unique comments (see examples in Table 2).
Although we can extract topics heuristically, they are sometimes noisy. For example, in (2) in Table 2, 'a problem' is extracted as a topic, despite being overly vague. Therefore, we use crowdsourcing to collect stance labels and additional topics from annotators.

Crowdsourcing
We use Amazon Mechanical Turk to collect crowdsourced annotations. We present each worker with a comment and first ask them to list topics related to the comment, to avoid biasing workers toward finding a stance on a topic not relevant to the comment. We then provide the worker with the automatically generated topic for the comment and ask for the stance, or, if the topic does not make sense, to correct it. Workers are asked to provide stance on a 5-point scale (see task snapshot in Appendix A.0.1) which we map to 3-point pro/con/neutral. Each topic-comment pair is annotated by three workers. We remove work by poor quality annotators, determined by manually examining the topics listed for a comment and using MACE (Hovy et al., 2013) on the stance labels. For all examples, we select the majority vote as the final label. When annotators correct the provided topic, we take the majority 1 spacy.io vote of stance labels on corrections to the same new topic.
Our resulting dataset includes annotations of three types (see Table 2): Heur stance labels on the heuristically extracted topics provided to annotators (see (1) and (5)), Corr labels on corrected topics provided by annotators (see (3)), List labels on the topics listed by annotators as related to the comment (see (2) and (4)). We include the noisy type List, because we find that the stance provided by the annotator for the given topic also generally applies to the topics the annotator listed and these provide additional learning signal (see A.0.2 for full examples). We clean the final topics to remove noise by lemmatizing and removing stopwords using NLTK 2 and running automatic spelling correction 3 .

Neutral Examples
Every comment will not convey a stance on every topic. Therefore, it is important to be able to detect when the stance is, in fact, neutral or neither. Since the original ARC data does not include neutral stance, our crowdsourced annotations yield only 350 neutral examples. Therefore, we add additional examples to the neutral class that are neither pro nor con. These examples are constructed automatically by permuting existing topics and comments.
We convert each entry of type Heur or Corr in the dataset to a neutral example for a different topic with probability 0.5. We do not convert type noisy List entries into neither examples. If a comment d i and topic t i pair is to be converted, we randomly sample a new topict i for the comment from topics in the dataset. To ensuret i is semantically distinct from t i , we check thatt i does not overlap lexically with t i or any of the topics provided to or by annotators for d i (see (6) Table 2).

Data Analysis
The final statistics of our data are shown in Table  3. We use Krippendorff α to compute interannotator agreement, yielding 0.427, and percentage agreement (75%), which indicate stronger than random agreement. We compute agreement only on the annotated stance labels for the topic provided, since few topic corrections result in identical new topics. We see that while the task is challenging, annotators agree the majority of the time.   We observe the most common cause of disagreement is annotator inference about stance relative to an overly general or semi-relevant topic. For example, annotators are inclined to select a stance for the provided topic (correcting the topic only 30% of the time), even when it does not make sense or is too general (e.g., 'everyone' is overly general).
The inferences and corrections by annotators provide a wide range of stance labels for each comment. For example, for a single comment our annotations may include multiple examples, each with different topic and potentially different stance labels, all correct (see (3) and (4) Table 2). That is, our annotations capture semantic and stance complexity in the comments and are not limited to a single topic per text. This increases the difficulty of predicting and annotating stance for this data.
In addition to stance complexity, the annotations provide great variety in how topics are expressed, with a median of 4 unique topics per comment. While many of these are slight variations on the same idea (e.g., 'prison privatization' vs. 'privatization'), this more accurately captures how humans might discuss a topic, compared to restricting themselves to a single phrase (e.g., 'gun control'). The variety of topics per comment makes our dataset challenging and the large number of topics with few examples each (the median number of examples per topic is 1 and the mean is 2.4) makes our dataset well suited to developing models for zeroshot and few-shot stance detection.

Methods
We develop Topic-Grouped Attention (TGA) Net: a model to implicitly construct and use relationships between the training and evaluation topics without supervision. The model consists of a contextual conditional encoding layer ( §4.2), followed by topic-grouped attention ( §4.4) using generalized topic representations ( §4.3) and a feed-forward neural network (see Figure 1).

Definitions
, a topic t i , and a stance label y i . Recall that for each unique document d, the data may contain examples with different topics. For example (1) and (2) (Table 2) have the same document but different topics. The task is to predict a stance label y ∈ {pro, con, neutral} for each x i , based on the topic-phrase definition of stance (see §2).

Contextual Conditional Encoding
Since computing the stance of a document is dependent on the topic, prior methods for cross-target stance have found that bidirectional conditional encoding (conditioning the document representation on the topic) provides large improvements (Augenstein et al., 2016). However, prior work used static word embeddings and we want to take advantage of contextual emebddings. Therefore, we embed a document and topic jointly using BERT (Devlin et al., 2019). That is, we treat the document and topic as a sentence pair, and obtain two sequences of token embeddingst = t (1) , . . . , t (m) for the topic t andd = d (1) , . . . , d (n) for the document d. As a result, the text embeddings are implicitly conditioned on the topic, and vice versa.

Generalized Topic Representations (GTR)
For each example x = (d, t, y) in the data, we compute a generalized topic representation r dt : the centroid of the nearest cluster to x in euclidean space, after clustering the training data. We use hierarchical clustering on v dt = [v d ; v t ], a representation of the document d and text t, to obtain clusters. We use one v d ∈ R E and one v t ∈ R E (where E is the embedding dimension) for each unique document d and unique topic t.
To obtain v d and v t , we first embed the document and topic separately using BERT (i.e., [ The token embeddings are weighted in v d by tf-idf, in order to downplay the impact of common content words (e.g., pronouns or adverbs) in the average. In  v t , the token embeddings are weighted uniformly.

Topic-Grouped Attention
We use the generalized topic representation r dt for example x to compute the similarity between t and other topics in the dataset. Using learned scaled dotproduct attention (Vaswani et al., 2017), we compute similarity scores s i and use these to weigh the importance of the current topic tokens t (i) , obtaining a representation c dt that captures the relationship between t and related topics and documents.
That is, we compute where W a ∈ R E×2E are learned parameters and λ = 1/ √ E is the scaling value.

Label Prediction
To predict the stance label, we combine the output of our topic-grouped attention with the document token embeddings and pass the result through a feed-forward neural network to compute the output probabilities p ∈ R 3 . That is, are learned parameters and h is the hidden size of the network. We minimize cross-entropy loss.

Data
We split VAST such that all examples We create separate zero-shot and few-shot development and test sets. The zero-shot development and test sets consist of topics (and documents) that are not in the training set. The few-shot development and test sets consist of topics in the training set (see Table 4). For example, there are 600 unique topics in the zero-shot test set (none of which are in the training set) and 159 unique topics in the fewshot test set (which are in the training set). This design ensures that there is no overlap of topics between the training set and the zero-shot development and test sets both for pro/con and neutral examples. We preprocess the data by tokenizing and removing stopwords and punctuation using NLTK.
Due to the linguistic variation in the topic expressions ( §3.2), we examine the prevalence of lexically similar topics, LexSimTopics, (e.g., 'taxation policy' vs. 'tax policy') between the training and zero-shot test sets. Specifically, we represent each topic in the zero-shot test set and the training set using pre-trained GloVe (Pennington et al., 2014) We manually examine a random sample of zero-shot dev topics to determine an appropriate threshold θ. Using the manually determined threshold θ = 0.9, we find that only 16% (96 unique topics) of the topics in the entire zero-shot test set are LexSimTopics.

Baselines and Models
We experiment with the following models: • CMaj: the majority class computed from each cluster in the training data.
• BoWV: we construct separate BoW vectors for the text and topic and pass their concatenation to a logistic regression classifier.
• C-FFNN: a feed-forward network trained on the generalized topic representations.
• BiCond: a model for cross-target stance that uses bidirectional encoding, whereby the topic is encoded using a BiLSTM as h t and the text is then encoded using a second BiLSTM conditioned on h t (Augenstein et al., 2016). This model uses fixed pre-trained word embeddings. A weakly supervised version of BiCond is currently state-of-the-art on crosstarget TwitterStance.
• CrossNet: a model for cross-target stance that encodes the text and topic using the same bidirectional encoding as BiCond and adds an aspect-specific attention layer before classification (Xu et al., 2018). Cross-Net improves over BiCond in many cross-target settings.
• BERT-sep: encodes the text and topic separately, using BERT, and then classification with a two-layer feed-forward neural network.
• BERT-joint: contextual conditional encoding followed by a two-layer feed-forward neural network.
• TGA Net: our model using contextual conditional encoding and topic-grouped attention.

Hyperparameters
We tune all models using uniform hyperparameter sampling on the development set. All models are optimized using Adam (Kingma and Ba , 2015), maximum text length of 200 tokens (since < 5% of documents are longer) and maximum topic length of 5 tokens. Excess tokens are discarded For BoWV we use all topic words and a comment vocabulary of 10, 000 words. We optimize using using L-BFGS and L2 penalty. For BiCond and Cross-Net we use fixed pre-trained 100 dimensional GloVe (Pennington et al., 2014) embeddings and train for 50 epochs with early stopping on the development set. For BERT-based models, we fix BERT, train for 20 epochs with early stopping and use a learning rate of 0.001. We include complete hyperparameter information in Appendix A.1.1.
We cluster generalized topic representations using Ward hierarchical clustering (Ward, 1963), which minimizes the sum of squared distances within a cluster while allowing for variable sized clusters. To select the optimal number of clusters k, we randomly sample 20 values for k in the range [50, 300] and minimize the sum of squared distances for cluster assignments in the development set. We select 197 as the optimal k.

Results
We evaluate our models using macro-average F1 calculated on three subsets of VAST (see Table 5): all topics, topics only in the test data (zero-shot), and topics in the train or development sets (fewshot). We do this because we want models that perform well on both zero-shot topics and training/development topics.

Test Topic
Cluster Topics drug addicts war drug, cannabis, legalization, marijuana popularity, social effect, pot, colorado, american lower class, gateway drug, addiction, smoking marijauana, social drug oil drilling natural resource, international cooperation, renewable energy, alternative energy, petroleum age, electric car, solar use, offshore drilling, offshore exploration, planet free college education tax break home schooling, public school system, education tax, funding education, public service, school tax, homeschool tax credit, community, home schooling parent  We first observe that CMaj and BoWV are strong baselines for zero-shot topics. Next, we observe that BiCond and Cross-Net both perform poorly on our data. Although these were designed for cross-target stance, a more limited version of zero-shot stance, they suffer in their ability to generalize across a large number of targets when few examples are available for each.
We see that while TGA Net and BERT-joint are statistically indistinguishable on all topics, the topic-grouped attention provides a statistically significant improvement for few-shot learning on 'pro' examples (with p < 0.05). Note that conditional encoding is a crucial part of the model, as this provides a large improvement over embedding the comment and topic separately (BERT-sep).
Additionally, we compare the performance of

TGA Net and BERT-joint on both zero-shot
LexSimTopics and non-LexSimTopics. We find that while both models exhibit higher performance on zero-shot LexSimTopics (.70 and .72 F1 respectively), these topics are such a small fraction of the zero-shot test topics that zero-shot evaluation primarily reflects model performance on the non-LexSimTopics. Additionally, the difference between performance on zero-shot LexSimTopics and non-LexSimTopics is less for TGA Net (only 0.04 F1) than for BERT-joint (0.06 F1), showing our model is better able to generalize to lexically distinct topics.
To better understand the effect of topic-grouped attention, we examine the clusters generated in §4.3 (see Table 6).  of unique topics per cluster ranging from 6 to 166 (median 43). We see that the generalized representations are able to capture relationships between zero-shot test topics and training topics.
We also evaluate the percentage of times each of our best performing models (BERT-joint and TGA Net) is the best performing model on a cluster as a function of the number of unique topics ( Figure 2) and cluster size (Figure 3). To smooth outliers, we first bin the cluster statistic and calculate each percent for clusters with at least that value (e.g., clusters with at least 82 examples). We see that as the number of topics per cluster increases, TGA Net increasingly outperforms BERT-joint. This shows that the model is able to benefit from diverse numbers of topics being represented in the same manner. On the other hand, when the number of examples per cluster becomes too large (> 182), TGA NET's performance suffers. This suggests that when cluster size is very large, the stance signal within a cluster becomes too diverse for topic-grouped attention to use.

Challenging Phenomena
We examine the performance of TGA Net and BERT-joint on five challenging phenomena in the data: i) Imp -the topic phrase is not contained in the document and the label is not neutral (1231 cases), ii) mlT -a document is in examples with multiple topics (1802 cases), iii) mlS -a document is in examples with different, non-neutral, stance labels (as in (3) and (4) Table 2) (952 cases), iv) Qte -a document with quotations, and v) Sarcsarcasm, as annotated by Habernal et al. (2018).
We choose these phenomena to cover a range of challenges for the model. First, Imp examples require the model to recognize concepts related to the unmentioned topic in the document (e.g., recognizing that computers are related to the topic '3d printing'). Second, to do well on mlT and mlS examples, the model must learn more than global topic-to-stance or document-to-stance patterns (e.g., it cannot predict a single stance label for all examples with a particular document). Finally, quotes are challenging because they may repeat text with the opposite stance to what the author expresses themselves (see Appendix Table 20 for examples).
Overall, we find the TGA Net performs better on these difficult phenomena (see Table 7). These phenomena are challenging for both models, as indicated by the generally lower performance on examples with the phenomena compared to those without, with the mlS especially difficult. We observe that TGA Net has particularly large improvements on the rhetorical devices (Qte and Sarc), suggesting that topic-grouped attention allows the model to learn more complex semantic information in the documents.

Stance and Sentiment
Finally, we investigate the connection between stance and sentiment vocabulary. Specifically, we use the MPQA sentiment lexicon (Wilson et al., 2017) to identify positive and negative sentiment words in texts. We observe that in the test set, the majority (80%) of pro examples have more positive than negative sentiment words, while only 41% of con examples have more negative than positive sentiment words. That is, con stance is often expressed using positive sentiment words but pro stance is rarely expressed using negative sentiment words and therefore there is not a direct mapping between sentiment and stance.
We use M + to denote majority positive sentiment polarity and similarity for M − and negative. We find that on pro examples with M −, TGA Net outperforms BERT-joint, while the reverse is true for con examples with M +. For both stance labels and models, performance increases when the majority sentiment polarity agrees with the stance label (M + for pro, M − for con). Therefore, we investigate how susceptible both models are to changes in sentiment.
To test model susceptibility to sentiment polarity, we generate swapped examples. For examples with majority polarity p, we randomly replace sentiment words with a WordNet 4 synonym of opposite polarity until the majority polarity for the example is −p (see Table 8). We then evaluate our models on Comment Topic ... we need(-) to get those GOP members out of the House & Senate, since they only support(+)→patronize(-) billionaire tax breaks, evidently(+)→obviously(-). We need(-) MORE PARKS. And they should all be FREE(+)→gratuitous(-) ... government spending on parks Pro ... debaters don't strike(-)→shine(+) me as being anywhere near diverse in their perspectives on guns. Not one of the gun-gang cited any example of where a student with a gun saved someone from something terrible(-) →tremendous(+) on their campuses. At least(-) the professor speaks up for rationality(+).
guns Con Table 8: Examples with changed majority sentiment polarity. Sentiment words are bold italicized, for removed words (struck out) and positive (green (+)) and negative (red (-)) sentiment words. When examples are changed from the opposite polarity (− → + for pro, + → − for con), a model that relies too heavily on sentiment should increase performance. Conversely, when converting to the opposite polarity (+ → − for pro, − → + for con) an overly reliant model's performance should decrease. Although the examples contain noise, we find that both models are reliant on sentiment cues, particularly when adding negative sentiment words to a pro stance text. This suggests the models are learning strong associations between negative sentiment and con stance.
Our results also show TGA Net is less susceptible to replacements than BERT-joint. On pro − → +, performance actually decreases by one point (BERT-joint increases by three points) and on con − → + performance only decreases by one point (compared to four for BERT-joint). TGA Net is better able to distinguish when positive sentiment words are actually indicative of a pro stance, which may contribute to its significantly higher performance on pro. Overall, TGA Net relies less on sentiment cues than other models.

Conclusion
We find that our model TGA Net, which uses generalized topic representations to implicitly capture relationships between topics, performs significantly better than BERT for stance detection on pro labels, and performs similarly on other labels. In addition, extensive analysis shows our model provides substantial improvement on a number of challenging phenomena (e.g., sarcasm) and is less reliant on sentiment cues that tend to mislead the models. Our models are evaluated on a new dataset, VAST, that has a large number of topics with wide linguistic variation and that we create and make available.
In future work we plan to investigate additional methods to represent and use generalized topic information, such as topic modeling. In addition, we will study more explicitly how to decouple stance models from sentiment, and how to improve performance further on difficult phenomena.

A.0.1 Crowdsourcing
We show a snap shot of one 'HIT' of the data annotation task in Figure 4. We paid annotators $0.13 per HIT. We had a total of 696, of which we removed 183 as a result of quality control.

A.0.2 Data
We show complete examples from the dataset in Table 10. These show the topics extracted from the original ARC stance position, potential annotations and corrections, and the topics listed by annotators as relevant to each comment.
In (a), (d), (i), (j), and (l) the topic make sense to take a position on (based on the comment) and annotators do not correct the topics and provide a stance label for that topic directly. In contrast, the annotators correct the provided topics in (b), (c), (e), (f), (g), (h), and (k). The corrections are because the topic is not possible to take a position on (e.g., 'trouble'), or not specific enough (e.g, 'california', 'a tax break'). In one instance, we can see that one annotator chose to correct the topic (k) whereas another annotator for the same topic and comment chose not to (j). This shows how complex the process of stance annotation is.
We also can see from the examples the variations in how similar topics are expressed (e.g., 'public education' vs. 'public schools') and the relationship between the stance label assigned for the extracted (or corrected topic) and the listed topic. In most instances, the same label applies to the listed topics. However, we show two instances where this is not the case: (d) -the comment actually supports 'public schools' and (i) -the comment is actually against 'airline'). This shows that this type of example (ListTopic, see §3.1.2), although somewhat noisy, is generally correctly labeled using the provided annotations.
We also show neutral examples from the dataset in Table 11. Examples 1 and 2 were constructed using the process described in §3.1.3. We can see that the new topics are distinct from the semantic content of the comment. Example 3 shows an annotator provided neutral label since the comment is neither in support of or against the topic 'women's colleges'. This type of neutral example is less common than the other (in 1 and 2) and is harder, since the comment is semantically related to the topic.

A.1.1 Hyperparameters
All neural models are implemented in Pytorch 5 and tuned on the developement. Our logistic regression model is implemented with scikit-learn 6 . The number of trials and training time are shown in Table  12. Hyperparameters are selected through uniform sampling. We also show the hyperparameter search space and best configuration for C-FFNN (Table  13), BiCond (Table 14), Cross-Net (Table 15), BERT-sep (Table 16), BERT-joint (Table 17) and TGA Net (Table 18). We use one TITAN Xp GPU.
We calculate expected validation performance (Dodge et al., 2019) for F1 in all three cases and additionally show the performance of the best model on the development set (Tale 19). Models are tuned on the development set and we use macro-averaged F1 of all classes for zero-shot examples to select the best hyperparameter configuration for each model. We use the scikit-learn implementation of F1. We see that the improvement of TGA Net over BERT-joint is high on the development set.

A.1.3 Error Analysis
We investigate the performance of our two best models (TGA Net and BERT-joint) on five challenging phenomena, as discussed in §5.4.1. The phenomena are: • Implicit (Imp): the topic is not contained in the document.
• Multiple Topics (mlT): document has more than one topic.
• Multiple Stance (mlS): a document has examples with different, non-neutral, stance labels.
We show examples of each of these phenomena in Table 20.

A.1.4 Stance and Sentiment
To construct examples with swapped sentiment words we use the MPQA lexicon (Wilson et al., 2017) for sentiment words. We use WordNet to select synonyms with opposite polarity, ignoring word sense and part of speech. We show examples from each set type of swap, + → − (Table 22) and − → + (Table 21). In total there are 1158 positive sentiment words and 1727 negative sentiment words from the lexicon in our data. Of these, 218 positive words have synonyms with negative sentiment, and 224 negative words have synonyms with positive sentiment.            Table 19: Best results on the development set and expected validation score (Dodge et al., 2019) for all tuned models. a is All, z is zero-shot, f is few-shot.
Type Comment Topic Imp No, it's not just that the corporations will have larger printers.
It is that most of us will have various sizes of printers. IT's just what happened with computers. I was sold when some students from Equador showed me their easy to make, working, prosthetic arm. Cost to make, less than one hundred dollars.

•3d printing Pro
Sarc yes, let's hate cyclists: people who get off their ass and ride, staying fit as they get around the city. they don't pollute the air, they don't create noise, they don't create street after street clogged with cars dripping oil... I think the people who hate cyclists are the same ones who hate dogs: they have tiny little shards of coal where their heart once was. they can't move fast or laugh, and want no one else to, either. According to the DMV, in 2009 there were 75,539 automobile crashes in New York City, less than 4 percent of those crashes involved a bicycle. cyclists are clearly the problem here.

•cyclists Pro
Qte "cunning, baffling and powerful disease of addiction" -LOL no. This is called 'demon possession'. Let people do drugs. They'll go through a phase and then they'll get tired of it and then they'll be fine. UNLESS they end up in treatment and must confess a disease of free will, in which case all bets are off.
•disease of addiction Con mlS That this is even being debated is evidence of the descent of American society into madness. The appalling number of gun deaths in America is evidence that more guns would make society safer? Only in the US does this kind of logic translate into political or legal policy. I guess that's what exceptionalism means.

Con Pro
mlT The focus on tenure is just another simplistic approach to changing our educational system. The judge also overlooked that tenure can help attract teachers. Living in West Virginia, a state with many small and isolated communities, why would any teacher without personal ties to our state come here, if she can fired at will? I know that I and my wife would not.
•tenure •stability Pro Pro