Discrete Argument Representation Learning for Interactive Argument Pair Identification

In this paper, we focus on identifying interactive argument pairs from two posts with opposite stances to a certain topic. Considering opinions are exchanged from different perspectives of the discussing topic, we study the discrete representations for arguments to capture varying aspects in argumentation languages (e.g., the debate focus and the participant behavior). Moreover, we utilize hierarchical structure to model post-wise information incorporating contextual knowledge. Experimental results on the large-scale dataset collected from CMV show that our proposed framework can significantly outperform the competitive baselines. Further analyses reveal why our model yields superior performance and prove the usefulness of our learned representations.


Introduction
Arguments play a central role in decision making on social issues. Striving to automatically understand human arguments, computational argumentation becomes a growing field in natural language processing. It can be analyzed at two levels -monological argumentation and dialogical argumentation. Existing research on monological argumentation covers argument structure prediction (Stab and Gurevych, 2014), claims generation (Bilu and Slonim, 2016), essay scoring (Taghipour and Ng, 2016), etc. Recently, dialogical argumentation becomes an active topic.
However, it is non-trivial to extract the interactive argument pairs holding opposite stances. Back to the example. Given argument b1 with only four words contained, it is difficult, without richer contextual information, to understand why it has interactive relationship with a1. In addition, without modeling the debating focuses of arguments, it is likely for models to wrongly predict that b2 has interactive relationship with a4 for sharing more words. Motivated by these observations, we pro-pose to explore discrete argument representations to capture varying aspects (e.g., the debate focus) in argumentation language and learn context-sensitive argumentative representations for the automatic identification of interactive argument pairs. For argument representation learning, different from previous methods focusing on the modeling of continuous argument representations, we obtain discrete latent representations via discrete variational autoencoders and investigate their effects on the understanding of dialogical argumentative structure. For context representation modeling, we employ a hierarchical neural network to explore what content an argument conveys and how they interact with each other in the argumentative structure. To the best of our knowledge, we are the first to explore discrete representations on argumentative structure understanding. In model evaluation, we construct a dataset collected from CMV 1 , which is built as part of our work and has been publicly released 2 . Experimental results show that our proposed model can significantly outperform the competitive baselines. Further analysis on discrete latent variables reveals why our model yields superior performance. At last, we show that the representations learned by our model can successfully boost the performance of argument persuasiveness evaluation.

Task Definition and Dataset Collection
In this section, we first define our task of interactive argument pair identification, followed by a description of how we collect the data for this task.

Task Definition
Given a argument q from the original post, a candidate set of replies consisting of one positive reply r + , several negative replies r − 1 ∼ r − u , and their corresponding argumentative contexts, our goal is to automatically identify which reply has interactive relationship with the quotation q.
We formulate the task of identifying interactive argument pairs as a pairwise ranking problem. In practice, we calculate the matching score S(q, r) for each reply in the candidate set with the quotation q and treat the one with the highest matching score as the winner.

Dataset Collection
Our data collection is built on the CMV dataset released by Tan et al. (2016). In CMV, users submit posts to elaborate their perspectives on a specific topic and other users are invited to argue for the other side to change the posters' stances. The original dataset is crawled using Reddit API. Discussion threads from the period between January 2013 and May 2015 are collected as training set, besides, threads between May 2015 and September 2015 are considered as test set. In total, there are 18,363 and 2,263 discussion threads in training set and test set, respectively.
An observation on CMV shows that when users reply to a certain argument in the original post, they quote the argument first and write responsive argument directly, forming a quotation-reply pair. Figure 2 shows how quotation-reply pairs could be identified. Inspired by this finding, we decide to Original Post: ... Strong family values in society lead to great results. I want society to take positive aspects of the early Americans and implement that into society. This would be a huge improvement than what we have now. ... User Post: &gt; I want society to take positive aspects of the early Americans and implement that into society. What do you believe those aspects to be? ... extract interactive argument pairs with the relation of quotation-reply. In general, the content of posts in CMV is informal, making it difficult to parse an argument in a finer-grain with premise, conclusion and other components. Therefore, following previous setting in Ji et al. (2018), we treat each sentence as an argument. Specifically, we only consider the quotation containing one argument and view the first sentence after the quotation as the reply. We treat the quotation-reply pairs extracted as positive samples and randomly select four replies from other posts that are also related to the original post to pair with the quotation as negative samples. In detail, each instance in our dataset includes the quotation, one positive reply, four negative replies, and the posts where they exist. The posts where they exist refer to argumentative contexts mentioned below. What's more, we remove quotations from argumentative contexts of replies.
We keep words with the frequency higher than 15 and this makes the word vocabulary with 20,692 distinct entries. In order to assure the quality of quotation-reply pairs, we only keep the instance where the number of words in the quotation and  Table 1: Overview statistics of the constructed dataset (mean and standard deviation). arg., q, p r , n r represent argument, quotation, positive reply and negative reply respectively. q-p r represents the quotation-reply pair between posts.
replies range from 7 to 45. We regard the instances extracted from training set and test set in Tan et al. (2016) for training and test. The number of instances in training and test set is 11,565 and 1,481, respectively. We randomly select 10% of the training instances to form the development set. The statistic information of our dataset is shown in Table 1.
To further demonstrate that quotation-reply pairs have interactive relationships, we randomly select 100 instances from the test set and hire two trained annotators who are fluent English speakers to identify interactive argument pairs. The accuracy of the two annotators is 0.83 and 0.93, respectively. The inter-annotator agreement measured by Co-hens Kappa (Carletta, 1996) is 0.82. This confirms the quality of the constructed dataset.

Proposed Model
The overall architecture of our model is shown in Figure 3(a). It takes a quotation, a reply and their corresponding argumentative contexts as inputs, and outputs a real value as its matching score. It mainly consists of three components, namely, and Argument Matching and Scoring. We learn discrete argument representations via DVAE and employ a hierarchical architecture to obtain the argumentative context representations. The Argument Matching and Scoring integrates some semantic features between the quotation and the reply to calculate the matching score.

Discrete Variational AutoEncoders
We employ discrete variational autoencoders (Rolfe, 2017) to reconstruct arguments from auto-encoding and obtain argument representations based on discrete latent variables to capture different aspects of argumentation languages.
Encoder. Given an argument x with words w 1 , w 2 , ..., w T , we first embed each word to a dense vector obtaining w 1 , w 2 , ..., w T correspondingly. Then we use a bi-directional GRU (Wang et al., 2018) to encode the argument.
We obtain the hidden state for a given word w t by concatenating the forward hidden state and backward hidden state. Finally, we consider the last hidden state h T as the continuous representation of the argument. Discrete Latent Variables. We introduce z as a set of K-way categorical variables z = {z 1 , z 2 , ..., z M }, where M is the number of variables. Here, each z i is independent and we can easily extend the calculation process below to every latent variables. Firstly, we calculate the logits l i as follows.
where W l ∈ R K×E stands for the weight matrix, E is the dimension of hidden units in encoder, while b l is a weight vector. After obtaining the logits l i , we can calculate the posterior distribution and discrete code of However, using discrete latent variables is challenging when training models end-to-end. To alleviate this problem, we use the recently proposed Gumbel-Softmax trick (Lu et al., 2017) to create a differentiable estimator for categorical variables. During training we draw samples g 1 , g 2 , ..., g K from the Gumbel distribution: g k ∼-log(-log(u)), where u ∼ U (0, 1) are uniform samples. Then, we compute the log-softmax of l i to get ω i ∈ R K : τ is a hyper-parameter. With low temperature τ , this vector ω i is close to the one-hot vector representing the maximum index of l i . But with higher temperature, this vector ω i is smoother.
Then we map the latent samples to the initial state of the decoder as follows: where W ei ∈ R K×D is the embedding matrix, D is the dimension of hidden units in decoder. Finally, we use a GRU as the decoder to reconstruct the ...  argument given h 0 dec . Discrete Argument Representations. Through the process of auto-encoding mentioned above, we can reconstruct the argument. The representation that we want to find can capture varying aspects in argumentation languages and contain salient features of the argument. q(z i |x) shows the probability distribution of z i over K categories, which contains salient features of the argument on varying aspects. Therefore, we obtain the discrete argument representation by the posterior distribution of discrete latent variables z.

Argumentative Context Modeling
Here, we introduce contextual information of the quotation and the reply to help identify the interactive argument pairs. The argumentative context contains a list of arguments. Following previous setting in Ji et al. (2018), we consider each sentence as an argument in the context. Inspired by Dong et al. (2017), we employ a hierarchical architecture to obtain argumentative context representations. Argument-level CNN. Given an argument and their embedding forms {e 1 , e 2 , ..., e n }, we employ a convolution layer to incorporate the context information on word level.
where W s and b s are weight matrix and bias vector. ws is the window size in the convolution layer and s i is the feature representation. Then, we conduct an attention pooling operation over all the words to get argument embedding vectors.
where W m and W u are weight matrix and vector, b m is the bias vector, m i and u i are attention vector and attention weight of the i-th word. a is the argument representation. Document-level BiGRU. Given the argument embedding {a 1 , a 2 , ..., a N }, we employ a bidirectional GRU to incorporate the contextual information on argument level.
Finally, we employ an average pooling over arguments to obtain the context representation C.

Argument Matching and Scoring
Once representations of the quotation and the reply are generated, three matching methods are applied to analyze relevance between the two arguments. We conduct element-wise product and element-wise difference to get the semantic features f p = R q * R r and f d = R q − R r . Furthermore, to evaluate the relevance between each word in the reply and the discrete representation of the quotation, we propose the quotation-guided attention and obtain a new representation of the reply. Quotation-Guided Attention. We conduct dot product between R q and each hidden state representation h r j in the reply. Then, a softmax layer is used to obtain an attention distribution. v Based on the attention probability v j of the j-th word in the reply, the new representation of the reply can then be constructed as follows: After obtaining the discrete representations, argumentative context representations and some semantic matching features f p , f d , f r of the quotation and the reply, we use two fully connected layers to obtain a higher-level representation H. Finally, the matching score S is obtained by a linear transformation.
where W H and W S stand for the weight matrices, while b H and b S are weight vectors.

Joint Learning
The proposed model contains three modules, i.e., the DVAE, argumentative context modeling and argument matching, which are trained jointly. We define the loss function of the overall framework to combine the two effects.
where λ is a hyper-parameter to balance the two loss terms. The first loss term is defined on the DVAE and cross entropy loss is defined as the reconstruction loss. We apply the regularization on KL cost term to solve posterior collapse issue. Due to the space limitation, we leave out the derivation details and refer the readers to Zhao et al. (2018).
The second loss term is defined on the argument matching. We formalize this issue as a ranking task and utilize hinge loss for training.
where u is the number of negative replies in each instance. γ is a margin parameter, S(q, r + ) is the matching score of the positive pair and S(q, r − i ) is the matching score of the i-th negative pair.

Training Details
We use Glove (Pennington et al., 2014) word embeddings with dimension of 50. The number of discrete latent variables M is 5 and the number of categories for each latent variable is also 5. What's more, the hidden units of GRU cell in encoder are 200 while that for the decoder is 400. We set batch size to 32, filter sizes to 5, filter numbers to 100, dropout with probability of 0.5, temperature τ to 1. The hyper-parameters in loss function are set as γ= 10 for max margin and λ= 1 for controlling the effects of discrete argument representation learning and argument matching.
The proposed model is optimized by SGD and applied the strategy of learning rate decay with initial learning rate of 0.1. We evaluate our model on development set at every epoch to select the best model. During training, we run our model for 200 epochs with early-stop (Caruana et al., 2000).

Comparison Models
For baselines, we consider simple models that rank argument pairs with cosine similarity measured with two types of word vectors: TF-IDF scores (henceforth TF-IDF) and the pre-trained word embeddings from word2vec corpus (henceforth WORD2VEC). Also, we compare with the neural models from related areas: MALSTM (Mueller and Thyagarajan, 2016), the popular method for sentence-level semantic matching, and CBCA-WOF (Ji et al., 2018), the state-of-the-art model to evaluate the persuasiveness of argumentative comments, which is tailored to fit our task. In addition, we compare with some ablations to study the contribution from our components. Here we first consider MATCH rnn , which uses BiGRU to learn argument representations and explore the match of arguments without modeling the context therein. Then we compare with other ablations that adopt varying argument context modeling methods. Here we consider BiGRU (henceforth MATCH rnn +C b ), which  focuses on words in argument context and ignores the argument interaction structure. We also consider a hierarchical neural network ablation (henceforth MATCH rnn +C h ), which models argument interactions with BiGRU and the words therein with CNN. In addition, we compare with MATCH ae +C h and MATCH vae +C h , employing auto-encoder (AE) and variational AE (VAE), respectively, to take the duty of the DVAE module of our full model.

Results and Discussions
To evaluate the performance of different models, we first show the overall performance of different models for argument pair identification. Then, we conduct three analyses including hyper-parameters sensitivity analysis, discrete latent variables analysis and error analysis to study the impact of hyperparameters, explain why DVAE performs well on interactive argument pair identification and analyze the major causes of errors. Finally, we apply our model to a downstream task to investigate the usefulness of discrete argument representations.

Overall Performance Comparison
The overall results of different models are shown in

Hyper-Parameter Sensitivity Analysis
We investigate the impact of two hyper-parameters on our model, namely the number of discrete latent variables M and the number of categories for each latent variable K in DVAE. For studying the impact of M and K, we set them as 1, 3, 5, 7, 9 respectively while keep other hyper-parameters the same as our best model. We report P@1 of different settings. As shown in Figure 4, we observe that curves obtained by changing the two parameters follow similar pattern. When the number increases, P@1 first gradually grows, reaching the highest at position 5 and drops gradually after that. When K and M are relatively high, say larger than 3, our model can always outperform VAE which is the most competitive baseline, indicating the effectiveness of the discrete representation for interactive arguments identification. Figure 5: Visualization of posterior distributions of discrete latent variables z 1 ∼ z 5 respectively. We find that the posterior distributions of z 1 ∼ z 5 of Positive reply is more similar to those of Quotation compared to other Negative replies.

Discrete Latent Variables Analysis
Here, we try to find out why DVAE performs best on interactive argument pair identification. Given an argument, we set M=5, K=5 and learn the corresponding discrete code set Z code (1) ∼ Z code (5).
We use the best model to select correct instances for argument matching in the dataset and cluster all quotations and corresponding replies according to the same discrete code set. We get 2,272 clusters, of which 119 clusters have more than 100 arguments and we find that arguments with the same discrete code set are semantically related.
To show the reason why DVAE performs well on our task more intuitively, we select a case from our dataset shown in Table 3 and employ DVAE to learn discrete representations for arguments to capture varying aspects z 1 ∼ z 5 . The posterior distributions of discrete latent variables z 1 ∼ z 5 for the quotation and replies are shown in Figure 5.
As shown in Figure 5, each subgraph shows the distribution of z i on K categories of the quotation and corresponding replies. We can find that the posterior distributions of z 1 ∼ z 5 of Positive reply are more similar to those of Quotation compared to other Negative replies. This finding proves that if the two arguments are more semantically related, their posterior distribution on each aspect z i should be more similar. This further interprets why Positive reply has interactive relationship with Quotation and why DVAE performs well on interactive argument pair identification.
Quotation: I bet that John Boehner would deal with congress as president more easily than Joe Biden due to his constant interaction with it. Positive reply: Do you think that have anything to do with the fact that Boehner is a republican, and congress is controlled by republicans? Negative reply 1: I would propose that the title of vice president be kept, but to remove their right to succession for presidency. Negative reply 2: Does Biden have the same level of respect from foreign nations needed to guide the country? Negative reply 3: He did lose however, so perhaps people do put weight into the vp choice. Negative reply 4: I don't know why you think this can be ignored.

Error Analysis
Here, we inspect outputs of our model to identify major causes of errors. Here are two major issues. -The number of M and K may not cover the latent space of all arguments in the dataset. Natural language is complex and diverse. If the size of the latent space doesn't fully contain semantic information of the arguments, it will cause the failure of our model. Considering the number of aspects may vary for different topics, it is not perfect to use a universal setting of K and M.
-Attention Error. In our model, we employ a quotation-guided attention to evaluate the relevance between each word in the reply and the discrete representation of the quotation. If the attention focuses on unimportant words, it causes errors. It might be useful to utilize discrete representation to further regulate the attention procedure.

Models
Pairwise accuracy Tan et al. (2016) 65.70 Ji et al. (2018) 70.45 Our model 84.50 Table 4: The performances of different models for the task of argumentative comments persuasiveness evaluation on the dataset in Tan et al. (2016). Numbers for the two comparative models are copied from their original papers.

Effectiveness on Argumentative Comments Persuasiveness Evaluation
To further investigate the usefulness of our learned representations, we apply them to a downstream task: persuasiveness evaluation for argumentative comments (Tan et al., 2016;Ji et al., 2018). It takes two arguments as input (one is original and another is a reply) and output a score to evaluate the quality of the reply. The reasons for choosing this task are two fold. First, both tasks focus on dialogical arguments. Second, both tasks can be formulated as a pairwise ranking problem. The performance of different models are shown in Table 4. Note that we use the original CMV dataset and follow the previous setup in Tan et al. (2016); Ji et al. (2018).
We find that our model outperforms the state-ofthe-art method (Ji et al., 2018) by a large margin, which indicates that our learned representation can well help downstream tasks.

Related Work
In this section, we will introduce two major areas related to our work, which are dialogical argumentation and argument representation learning.

Dialogical Argumentation
Computational argumentation is a growing subfield of natural language processing in which arguments are analyzed in various respects. Previous works mainly focus on analyzing the argumentative structure in texts. Recently, the dialogical argumentation has become an active topic. Dialogical argumentation refers to a series of interactive arguments related to a given topic, involving argument retraction, view exchange, and so on. Existing research covers discourse structure prediction , dialog summarization (Hsueh and Moore, 2007), etc. There are several attempts to address tasks related to analyzing the relationship between arguments (Wang and Cardie, 2014;Persing and Ng, 2017) and evaluat-ing the quality of persuasive arguments (Habernal and Gurevych, 2016). Gottipati et al. (2013) use sentiment lexicons as a preprocessing step and propose a probabilistic graphical model to predict stance of arguments in their dataset. Park et al. (2011) design several argumentation-motivated features to finish the debate stance classification in Korean newswire discourse. Sridhar et al. (2015) consider the joint stance classification of arguments and relations among them and find a multi-level model will work better. for a combination of post-level and authorlevel collective modeling of both stance and disagreement could bring further improvements in performance.
Wang and Cardie (2014) create a dispute corpus from Wikipedia and use a sentiment analysis to predict the dispute label of arguments. Wei et al. (2016) collect a dataset from CMV and analyze the correlation between disputing quality and disputation behaviors. analyze the disputation action in the online debate. Given an original argument and an argument disputing it, they aims to evaluate the quality of a disputing comment based on the original argument and the discussed topic. Habernal and Gurevych (2016) crowdsource the UKPCon-vArg1 corpus to study what makes an informal social media argument convincing. They crowdsource the UKPConvArg1 corpus and use SVM and bidirectional LSTM to experiment on their annotated datasets. Tan et al. (2016) pay attention to belief change in the ChangeMyView subreddit, in which an original poster challenges others to change his/her opinion. They construct datasets from CMV and employ logistic regression to predict which reply in the pair is more persuasive. In addition, Persing and Ng (2017) annotate a corpus with persuasiveness scores and the errors they contain to analyze why arguments are unpersuasive.
Previous work mainly focuses on analyzing interactions between two arguments in debate. However, there is limited research on the interactions between posts. In this work, we propose a novel task of identifying interactive argument pairs from argumentative posts to further understand the interactions between posts. Our work is also related with some similar tasks, such as question answering and sentence alignment. They focus on the design of attention mechanism to learn sentence representations (Wang et al., 2017a) and their relations with others (Wang et al., 2017b). Our task is inherently different from theirs because our target arguments naturally occur in the complex interaction context of dialogues, which requires additional efforts for understanding the discourse structure therein.

Argument Representation Learning
Argument representation learning for natural language has been studied widely in the past few years. Previous work discuss prior approaches to learning argument representations from labelled and unlabelled data.
There have been attempts to use labeled/structured data to learn argument representations. Wieting et al. (2016) and  introduce a large sentential paraphrase dataset and use paraphrase data to learn an encoder that maps synonymous phrases to similar embeddings.  explore the use of machine translation to obtain more paraphrase data via back-translation of bilingual argument pairs for learning paraphrastic embeddings. They show how neural backtranslation could be used to generate paraphrases. Hermann and Blunsom (2013) explore a language-specific encoder applied to each argument and represent the argument by the mean vector of the words involved. They consider minimizing the inner product between paired arguments in different languages as the training objective and do not rely on word alignments. Conneau et al. (2017) propose a model called InferSent, which is used as the baseline as it served as the inspiration for the inclusion of the SNLI task in the multitask model. They prove that NLI is an effective task for pre-training and transfer learning in obtaining generic argument representations. They train argument encoders from identifying one of three relationships between two given argumentsentailment, neutral and contradiction. Results prove that the argument representations learned by this task perform strongly on downstream transfer tasks.
Due to the availability of practically unlimited textual data, learning argument representations via unsupervised methods is an attractive proposition. Kiros et al. (2015) present the model called Skip Thought for learning representations by predicting the previous and next argument, which is a generalization of the skip-gram model (Mikolov et al., 2013). Exploiting the relatedness inherent in adjacent arguments, the model is trained by using the encoder to encode a particular argument and then using the decoder to decode words in adjacent arguments. Bowman et al. (2016) introduce variational autoencoders to incorporate distributed latent representations of entire arguments. In addition, Hill et al. (2016) propose the FastSent model, using bag-of-words of arguments to predict the adjacent arguments. Logeswaran and Lee (2018) propose the Quick Thoughts to exploit the closeness of adjacent arguments. They formulate the argument representation learning as a classification problem.
Previous work focuses on learning continuous argument representations with no interpretability. In this work, we study the discrete argument representations, capturing varying aspects in argumentation languages.

Conclusion and Future Work
In this paper, we propose a novel task of interactive argument pair identification from two posts with opposite stances on a certain topic. We examine contexts of arguments and induce latent representations via discrete variational autoencoders. Experimental results on the dataset show that our model significantly outperforms the competitive baselines. Further analyses reveal why our model yields superior performance and prove the usefulness of discrete argument representations.
The future work will be carried out in two directions. First, we will study the usage of our model for applying to other dialogical argumentation related tasks, such as debate summarization. Second, we will utilize neural topic model for learning discrete argument representations to further improve the interpretability of representations.