ECNU at SemEval-2018 Task 12: An End-to-End Attention-based Neural Network for the Argument Reasoning Comprehension Task

This paper presents our submissions to SemEval 2018 Task 12: the Argument Reasoning Comprehension Task. We investigate an end-to-end attention-based neural network to represent the two lexically close candidate warrants. On the one hand, we extract their different parts as attention vectors to obtain distinguishable representations. On the other hand, we use their surrounds (i.e., claim, reason, debate context) as another attention vectors to get contextual representations, which work as final clues to select the correct warrant. Our model achieves 60.4% accuracy and ranks 3rd among 22 participating systems.


Introduction
Reasoning is a crucial part of natural language argumentation. In order to comprehend an argument, one must analyze its warrant, which explains why its claim follows form its premises (aka reasons) (Habernal et al., 2018a). SemEval-2018 Task 12 provides the argument reasoning comprehension task (Habernal et al., 2018b). Given a reason and a claim along with the title and a short description of the debate they occur in, the goal is to identify the correct warrant from two candidates. Figure 1 gives an example. The abstract structure of an argument is: reason → (since) warrant → (therefore) claim.
The challenging factor is that both candidate warrants are plausible and lexically very close while leading to contradicting claims (Habernal et al., 2018a). Here we give three examples of the two candidate warrants: Ex1: A huge pandemic would (not) be a great news story.
Ex2: The role of a citizen and a supreme court justice are inseparable /separable.

Ex3:
The rest of the comments can be skipped easily /make the section unreadable.  The differences are either negative words, or antonyms, or opposite phrases. Therefore, it is important to emphasize the different parts to obtain distinguish representations of the two warrants, which express the opposite meanings.
To address this factor, we proposed an end-toend attention-based neural network. On the one hand, we extract the different parts of the two warrants and use them as attention vectors to obtain warrants' distinguishable representations. On the other hand, we represent their surrounds (i.e., reason, claim, debate context) as another attention vector to get the contextual representations.

System Architecture
Formally, given an instance containing two candidate warrants (W 0 , W 1 ) and the context around the warrants (i.e., R, C, T, I), the goal is to choose the correct warrant y ∈ {0, 1}, where y = 0 means W 0 is the correct answer, and y = 1 otherwise.

Overview
The network is inspired by Siamese network (Mueller and Thyagarajan, 2016). The two candidate warrants are modeled in the same structure. Figure 2 illustrates our system architecture.
First, we first extract the different parts of war- Next, the intra-temporal attention is adopted to obtain distinguishable representations of warrants. Similarly, we apply the same strategy to the claim. The intra-temporal attention is introduced in Sec.

2.4).
After that, we concatenate the representation of surrounds (i.e., reason, claim, debate context and warrant) as another attention vector to obtain the contextual representations of the warrants.
Last, we adopt a dense layer to obtain the probability of the two candidate warrants (in Sec. 2.5). The contextual representations work as hidden clues to select the correct warrant.

Extract the Different Part
The two candidate warrants are lexically close (since they often mean the opposite), thus we extract the different part between them to serve as attention vector to guide the neural network to generate distinguishable representation for the warrants.
To do this, we remove the longest common prefix and suffix, and let the remain part as the different part, denoted as Diff W 0 , Diff W 1 . Take cases mentioned in Sec. 1 as examples, it would extract "not be " as Diff W 0 and "be" as Diff W 1 in Ex1; "inseparable" as Diff W 0 and "separable" as Diff W 1 in Ex2; "can be skipped easily" as Diff W 0 and "make the section unreadable" as Diff W 1 in Ex3. Note that if the remain part is empty, we use the word after the prefix as the different part.
Similarly, we also get the different part between the claim and its opposite, denoted as Diff Claim. There always exists the opposite claims in debates, since the reason chains R → W → C and R → W → C both exists. We collected the claims and warrants under the same debate. If the warrants express the opposite meaning, then the two claims are opposite. Besides, the organizers also provide similar dataset in "data/train-w-swapdoubled.tsv".

Context Representation
To incorporate contextual information of each components in a debate, we combine Convolutional Neural Network (CNN) and Recurrent neural network (RNN) to encode the input word vectors. CNN is good at dealing with spatially related data, such as "sometimes warranted" and "rarely warranted", while RNN is good at temporal signals. Instead of using a typical vanilla RNN, we use Long Short-Term Memory Network (Hochreiter and Schmidhuber, 1997) for eliminating the issue of long term dependencies.
Given a sentence S = {w i } n 1 , we first map each word w i into its vector representation x i ∈ R d via a look-up table of word embeddings (d is the dimension of the word embeddings).
Then, we adopt CNN on the input sequence {x i } n 1 to obtain the spatial representation {x i } n 1 : where k is the window size, w j is the parameter of a filter, m is the number of the filters. We also adopt padding before the convolution operation.
As a result, we obtain the spatial representations x i ∈ R m , which has the same length as the input sequence.
After that, we utilize a bi-directional LSTM (Bi-LSTM) to obtain the temporal information. For each time step t, the LSTM unit computation corresponds to : where σ is the element-wise sigmoid function, is the element-wise product and i t , f t , o t ,c t demote the input gate, forget gate, output gate and memory cell respectively.

Intra-Temporal Attention
Inspired from Habernal et al. (2018a), we use an intra-temporal attention function to attend over specific parts of the input sequence. This kind of attention encourages the model to generate different representation according to the attention vector. Habernal et al. (2018a) have shown that such intra-temporal attention outperforms standard attention. We define v a as the attention vector, and h t as the hidden states at time step t: where a t is the attention weights over the hidden states h t , is element-wise multiplication.
We first apply the intra-temporal attention over W 0 and W 1 , in order to obtain different warrant representations from Diff W 0 and Diff W 1 . As a result, the model can easily distinguish the two candidate warrants. Similarly, we apply the attention over claim to make the claim representation distinguishable.
Moreover, we adopt another intra-temporal attention over W 0 and W 1 , with the concatenation of {claim, reason, debate context} representations as attention vector. The candidate warrants receive the information from the claim, reason and debate context, and the model would select the correct warrant which satisfies the reasoning chain R → W → C.
Finally, we obtain two attended warrant vectors att W 0 , att W 1 .

Output
To evaluate the probability distribution of the two candidate warrants, we employ a feed-forward neural network with one dense layer, and apply the Softmax function to predict the probability.
As for the optimization, cross-entropy loss is used as the loss function since we are handling a binary classification problem: where y i is the gold label.

Datasets
SemEval 2018 provided 1,970 instances for the argument reasoning comprehension task (Habernal et al., 2018b). The instances are divided into three sets based on the year when the debates are taken from. Table 1 lists the statistics of the datasets. We also include the number of debate topics of each set.

Parameters Setting
The word embeddings are initialized with the 300d pre-trained word2vec (Mikolov et al., 2013), and do not fine-tune during training. The window sizes of CNN is (1,2,3) and the kernel size is 50. The dimensions of the hidden size in Bi-LSTM and Att-LSTM are set to 50. The dense layer in Output is 25. We train the model using Adam (Kingma and Ba, 2014) with gradient clipping (the max norm is set to 30, batch size is 32), The networks are regularized by dropout (the dropout ratio equals 0.8). We ran each model three times with random initializations. Our code is available at https://github.com/rgtjf/ SemEval2018-Task12. Table 2 shows the results of each components of our attention-based neural network. We have the following findings:

Results on Training Data
(1) Comparing with the Intra-warrant attention (w/ context) provided by the organizer, our basic model obtains 2.8% improvement through sharing parameters in Bi-LSTM. It indicates that the neural network need sufficient training data and parameters sharing could alleviate the demand.
(2) All of the three improvements achieves improved accuracy. It suggests that utilizing the different part as attention vector can obtain discriminative representation, which is beneficial for choosing the correct answer.
(3) The introduction of CNN does not seem to improve the performance of the model. The possible reason may be that RNN actually learn any computational function and capture the spatial information.
(4) The ensemble of the three networks can further improve the performance. Therefore, we configure the ensemble model as our final submission.   Table 3 lists the results of three top systems and several baselines provided by the organizer. We find that: (1) Comparing with Intra-warrant attention w/ context model, our model outperforms it by 2% in terms of accuracy, which demonstrates the efficiency of the proposed attentionbased model. (2) Comparing with GIST and blcu nlp, our result is comparable to blcu nlp but worse than GIST. Both of them use the pre-trained ESIM model (Chen et al., 2017) trained on SNLI (Bowman et al., 2015) and MultiNLI (Nangia et al., 2017) dataset. Our model only uses the training dataset and does not require any extra resources. However, this is also the limitation of our model because this small-size dataset is insufficient to learn parameters in our model.

Conclusion
In this work, we propose an end-to-end neural network for the reading comprehension task. We stack a CNN and a RNN to represent each component in a debate and extract the warrants' and claim's different part as attention vector to obtain their distinguish representation. Moreover, we use another attention network to incorporate the information of reason, claim, debate context into the contextual representation of the warrants for final decisions. Our model achieves 60.4% accuracy and ranks 3 rd among 22 participating systems.