Recognising Agreement and Disagreement between Stances with Reason Comparing Networks

We identify agreement and disagreement between utterances that express stances towards a topic of discussion. Existing methods focus mainly on conversational settings, where dialogic features are used for (dis)agreement inference. We extend this scope and seek to detect stance (dis)agreement in a broader setting, where independent stance-bearing utterances, which prevail in many stance corpora and real-world scenarios, are compared. To cope with such non-dialogic utterances, we find that the reasons uttered to back up a specific stance can help predict stance (dis)agreements. We propose a reason comparing network (RCN) to leverage reason information for stance comparison. Empirical results on a well-known stance corpus show that our method can discover useful reason information, enabling it to outperform several baselines in stance (dis)agreement detection.


Introduction
Agreement and disagreement naturally arise when peoples' views, or "stances", on the same topics are exchanged.Being able to identify the convergence and divergence of stances is valuable to various downstream applications, such as discovering subgroups in a discussion (Hassan et al., 2012;Abu-Jbara et al., 2012), improving recognition of argumentative structure (Lippi and Torroni, 2016), and bootstrapping stance classification with (dis)agreement side information (Sridhar et al., 2014;Ebrahimi et al., 2016).
Previous efforts on (dis)agreement detection are confined to the scenario of natural dialogues (Misra and Walker, 2013;Wang and Cardie, 2014;Sridhar et al., 2015;Rosenthal and McKeown, 2015), where dialogic structures are used to create a conversational context for (dis)agreement inference.However, nondialogic stance-bearing utterances are also very common in real-world scenarios.For example, in social media, people can express stances autonomously, without the intention of initiating a discussion (Mohammad et al., 2016).There are also corpora built with articles containing many self-contained stance-bearing utterances (Ferreira and Vlachos, 2016;Bar-Haim et al., 2017).
Studying how to detect (dis)agreement between such independent stance-bearing utterances has several benefits: 1) pairing these utterances can lead to a larger (dis)agreement corpus for training a potentially richer model for (dis)agreement detection; 2) the obtained pairs enable training a distance-based model for opinion clustering and subgroup mining; 3) it is applicable to the aforementioned non-dialogic stance corpora; and 4) it encourages discovering useful signals for (dis)agreement detection beyond dialogic features (e.g., the reason information studied in this work).
In this work, we investigate how to detect (dis)agreement between a given pair of (presumably unrelated) stance-bearing utterances.Table 1 shows an example where a decision is made on whether two utterances agree or disagree on a discussed topic.This task, however, is more challenging, as the inference has to be made without using any contextual information (e.g., dialogic structures).To address this issue, one may need to seek clues within each of the compared utterances to construct appropriate contexts.stances, people usually back up their stances with specific explanations or reasons (Hasan and Ng, 2014;Boltuzic and Snajder, 2014).These reasons are informative about which stance is taken, because they give more details on how a stance is developed.However, simply comparing the reasons may not be sufficient to predict stance (dis)agreement, as sometimes people can take the same stance but give different reasons (e.g., the points outlaws having guns and freedom of speech mentioned in Table 1).One way to address this problem is to make the reasons stancecomparable, so that the reason comparison results can be stance-predictive.
In this paper, in order to leverage reason information for detecting stance (dis)agreement, we propose a reason embedding approach, where the reasons are made stance-comparable by projecting them into a shared, embedded space.In this space, "stance-agreed" reasons are close while "stance-disagreed" ones are distant.For instance, the reason points outlaws having guns and freedom of speech in Table 1 would be near to each other in that space, as they are "agreed" on the same stance.We learn such reason embedding by comparing the reasons using utterance-level (dis)agreement supervision, so that reasons supporting agreed (disagreed) stances would have similar (different) representations.A reason comparing network (RCN) is designed to learn the reason embedding and predict stance (dis)agreement based on the embedded reasons.Our method complements existing dialogic-based approaches by providing the embedded reasons as extra features.We evaluate our method on a well-known stance corpus and show that it successfully aligns reasons with (dis)agreement signals and achieves state-ofthe-art results in stance (dis)agreement detection.

RCN: The Proposed Model
Figure 1 illustrates the architecture of RCN.At a high level, RCN is a classification model that takes as input an utterance pair (P, Q) and a topic T , and outputs a probability distribution y over three classes {Agree, Disagree, N either} for stance comparison.To embed reasons, RCN uses two identical sub-networks (each contains an RNN encoder and a reason encoder) with shared weights to extract reason information from the paired utterances and predict their stance (dis)agreement based on the reasons.
RNN Encoder: In this module, we use RNNs to encode the input utterances.We first use word embedding to vectorise each word in the input utterance pair (P, Q) and topic T , obtaining three sequences of word vectors P, Q, and T. Then we use two BiLSTMs to encode the utterance and topic sequences, respectively.Moreover, by following the work of Augenstein et al. (2016), we use conditional encoding to capture the utterances' dependencies on the topic.The output are two topic-encoded sequences produced by the utterance BiLSTM for P (Q), denoted by , where h is the hidden size of a unidirectional LSTM.
Reason Encoder: Then we extract reasons from the utterances, which is the main contribution of this work.In particular, we focus on the major reasons that most people are concerned with, which possess two properties: 1) they are focal points mentioned to support a specific stance; 2) they recur in multiple utterances.With such properties, the extraction of these reasons can then be reduced to finding the recurring focal points in all the input utterances.
To action on this insight, we take a weightingbased approach by learning a weighting matrix A that captures the relatedness between each position in an utterance and each implied reason.For example, on utterance P where we hypothesise κ possible reasons, the weighting matrix is A |P |×κ , with each cell A i,k representing the relatedness between the ith position of P and the kth reason.
To learn the weighting matrix A, we use selfattention (Cheng et al., 2016) and develop a particular self-attention layer for implementing the above weighting scheme.Meanwhile, the recurrence of a reason is also perceivable, as all utterances mentioning that reason are used to learn the self-attention layer.
Our particular self-attention applied on an utterance is designed as follows.First, a pairwise relatedness score is computed between each pair of positions (h i ,h j ) with a bilinear transformation, c i,j = tanh(h i W (1) h j ), where W (1) ∈ R 2h×2h is a trainable parameter.Next, for each position h i , we convert its relatedness scores with all other positions into its overall relatedness scores with κ possible reasons using a linear transformation f , where Q|×κ and b ∈ R κ are trainable parameters.The philosophy behind Eq. 1 is that the relatedness distribution {c i,j } of an utterance implies segments in it that are internally compatible, which may correspond to different focal points (reasons).The transformation f then learns the mapping between those two.Finally, we obtain the attention weight A i,k for each position i on each reason k by applying softmax over all e * ,k s, With A obtained, we can compute an utterance's reason encoding as the sum of its RNN encoding {h i } weighted by A: r k = i=1 a k,i h i , where r k ∈ R 2h is the encoding for the kth reason.We use R P ] ∈ R 2h×κ to denote the reason matrix for the utterance P (Q).
It is worth noting that the above self-attention mechanism in our reason encoding can also be seen as a variant of multi-dimensional selfattention, as we simultaneously learn multiple attention vectors for the different reasons implied in an utterance.
Stance Comparator: Now we compare the stances of P and Q based on their reason matrices.Since we have captured multiple reasons in each utterance, all the differences between their reasons must be considered.We thus take a reason-wise comparing approach, where every possible pair of reasons between P and Q is compared.We employ two widely used operations for the comparison, i.e., multiplication: s mul i,j = r P i r Q j and subtraction: , where denote element-wise multiplication.We then aggregate all the differences resulting from each operation into a single vector, by using a global max-pooling to signal the largest difference with respect to an operation, The concatenation of the two difference vectors s = [s mul ; s sub ] forms the output of this module.
(Dis)agreement Classifier: Finally, a classifier is deployed to produce the (dis)agreement class probability ŷ = {ŷ 1 , ŷ2 , ŷ3 } based on the comparison result s, which consists of a two-layer feedforward network followed by a softmax layer, ŷ = softmax(FeedForward(s)).
Optimisation: To train our model, we use the multi-class cross-entropy loss, where N is the size of training set, y ∈ {Agree, Disagree, N either} is the ground-truth label indicator for each class, and ŷ is the predicted class probability.λ is the coefficient for L 2 -regularisation.Θ denotes the set of all trainable parameters in our model.Minimising Eq. 4 encourages the comparison results between the extracted reasons from P and Q to be stance-predictive.

Related Work
Our work is mostly related to the task of detecting agreement and disagreement in online discussions.Recent studies have mainly focused on classifying (dis)agreement in dialogues (Abbott et al., 2011;Wang and Cardie, 2014;Misra and Walker, 2013;Allen et al., 2014).In these studies, various features (e.g., structural, linguistic) and/or specialised lexicons are proposed to recognise (dis)agreement in different dialogic scenarios.In contrast, we detect stance (dis)agreement between independent utterances where dialogic features are absent.
Stance classification has recently received much attention in the opinion mining community.Different approaches have been proposed to classify stances of individual utterances in ideological forums (Murakami and Raymond, 2010;Somasundaran and Wiebe, 2010;Gottopati et al., 2013;Qiu et al., 2015) and social media (Augenstein et al., 2016;Du et al., 2017;Mohammad et al., 2017).In our work, we classify (dis)agreement relationships between a pair of stance-bearing utterances.
Reason information has been found useful in argumentation mining (Lippi and Torroni, 2016), where studies leverage stance and reason signals for various argumentation tasks (Hasan and Ng, 2014;Boltuzic and Snajder, 2014;Sobhani et al., 2015).We study how to exploit the reason information to better understand the stance, thus addressing a different task.
Our work is also related to the tasks on textual relationship inference, such as textual entailment (Bowman et al., 2015), paraphrase detection (Yin and Schütze, 2015), and question answering (Wang et al., 2016).Unlike the textual relationships addressed in those tasks, the relationships between utterances expressing stances do not necessarily contain any rephrasing or entailing semantics, but they do carry discourse signals (e.g., reasons) related to stance expressing.

Setup
Dataset: The evaluation of our model requires a corpus of agreed/disagreed utterance pairs.For this, we adapted a popular corpus for stance detection, i.e., a collection of tweets expressing stances from SemEval-2016 Task 6.It contains tweets with stance labels (Favour, Against, and None) on five topics, i.e., Climate Change is a Real Concern (CC), Hillary Clinton (HC), Feminist Movement (FM), Atheism (AT), and Legalization of Abortion (LA).We generated utterance pairs by randomly sampling from those tweets as follows: Agreement samples: 20k pairs labelled as (Favour, Favour) or (Against, Against); Disagreement samples: 20k pairs as (Favour, Against), (Favour, None), or (Against, None); Unknown samples: 10k pairs as (None, None) 1 .
Baselines: We compared our method with the following baselines: 1) BiLSTM: a base model for our task, where only the RNN encoder is used to encode the input; 2) DeAT (Parikh et al., 2016): a popular attention-based models for natural language inference.3) BiMPM (Wang et al., 2017): a more recent natural language inference model 1 Fewer unknown pairs being sampled is due to the inherently fewer none-stance tweets in the original corpus.where two pieces of text are matched from multiple perspectives based on pooling and attention.
Training details: An 80%/10%/10% split was used for training, validation and test sets.All hyper-parameters were tuned on the validation set.The word embeddings were statically set with the 200-dimensional GloVe word vectors pre-trained on the 27B Twitter corpus.The hidden sizes of LSTM and FeedForward layers were set to 100.A light dropout (0.2) was applied to DeAT and heavy (0.8) to the rest.ADAM was used as the optimiser and learning rate was set to 10 −4 .Early stopping was applied with the patience value set to 7.

Results
Table 2 shows the results of our method and all the baselines on tasks with different topics.We can first observe that the proposed RCN consistently outperformed all the baselines across all topics.Despite being modest, all the improvements of RCN over the baselines are statistically significant at p < 0.05 with a two-tailed t-test.Among these methods, BiLSTM performed the worst, showing that only using the RNN encoder for sequence encoding is not sufficient for obtaining optimal results.DeAT and BiMPM performed similarly well; both used attention to compare the utterances at a fine-grained level, resulting in a 2∼5% boost over BiLSTM.Finally, RCN performed the best, with relative improvements from 2.1% to 10.4% over the second best.As all the compared methods shared the same RNN encoding layers, that RCN performed empirically the best demonstrates the efficacy of its unique reason encoder and stance comparator in boosting performance.

Analysis
In this section, we study what has been learned in the reason encoder of RCN.In particular, we show the attentive activations in the reason en- Reason 1: @HillaryClinton is a liar & corrupt .Period.
End of story.
Reason 1: @HillaryClinton lies just for the fun of it, its CRAZY!!!!! 2 Disagree LA (F, A) Reason 1: I would never expect an 11 year old girl to have to carry a pregnancy to term Reason 1: Actually, child murder is far worse these days.
We live in more savage times.
3 Agree CC (F, F) Reason 1: Living an unexamined #life may be easier but leads to disastrous consequences.
Reason 1: There's no more normal rains anymore.Always storms , heavy and flooding .
Reason 2: Living an unexamined #life may be easier but leads to disastrous consequences .
Reason 2: There's no more normal rains anymore.Always storms , heavy and flooding.
Table 3: The heatmaps of the attention weights assigned by the attention layer in the reason encoder to three tweetpair examples.In each example, we show the text of each tweet, the topic, the correct (dis)agreement label, and the stance of each tweet (F: Favour, A: Against).Visualising attention signals in tweets: Table 3 shows the attention activations on three examples of tweet pairs chosen from our test set.For the first two, we set the number of reasons to be attended to as one.It can be seen that the parts of the tweets that received large attention weights (the highlighted words in Table 3) were quite relevant to the respective topics; liar, corrupt, and lie are words appearing in news about Hillary Clinton; girl, pregnancy, and murder are common words in the text about Legalisation of Abortion.Also, most of the highlighted words have concrete meanings and are useful to understand why the stances were taken.The last row shows a case when two reasons had been attended to.We observe a similar trend as before that the highlighted contents were topicspecific and stance-revealing.Moreover, since one more reason dimension was added to be inferred in this case, RCN was able to focus on different parts of a tweet corresponding to the two reasons.
Visualising learned reasons: We also visualised the reasons learned by our model, represented as the words assigned with the largest attention weights in our results (i.e., 1.0).Table 4 shows samples of such reason words.We see that the reason words have strong correlations with the respective topics, and, more importantly, they reflect different reason aspects regarding a topic, such as economy vs. community on Climate Change is a Real Concern and culture vs. justice on Legalisation of Abortion.
In summary, both the visualisations in Table 3  and 4 show that the attention mechanism employed by RCN is effective in finding different reason aspects that contribute to stance comparison.

Conclusion and Future Work
In this paper, we identify (dis)agreement between stances expressed in paired utterances.We exploit the reasons behind the stances and propose a reason comparing network (RCN) to capture the reason information to infer the stance (dis)agreement.A quantitative analysis shows the effectiveness of RCN in recognising stance (dis)agreement on various topics.A visualisation analysis further illustrates the ability of RCN to discover useful reason aspects for the stance comparison.
In the future, this work can be progressed in several ways.First, it is necessary to evaluate our model on more stance data with different linguistic properties (e.g., the much longer and richer stance utterances in posts or articles).Second, it is important to show how the learned embedded reasons can help downstream applications such as stance detection.Finally, it would be insightful to further visualise the reasons in the embedded space with more advanced visualisation tools.

Figure 1 :
Figure 1: The architecture of RCN.

Table 1 :
The task of detecting stance (dis)agreement between utterances towards a topic of discussion.

Table 2 :
Two tailed t-test: * * p < 0.01; * p < 0.05 Classification performance of the compared methods on various topics, measured by the averaged macro F1-score over ten runs on the test data.

Table 4 :
The reason words learned on various topics.
coder (i.e., A in Eq. 2), and see if reason-related contents could draw more attention from RCN.