YNU Deep at SemEval-2018 Task 12: A BiLSTM Model with Neural Attention for Argument Reasoning Comprehension

This paper describes the system submitted to SemEval-2018 Task 12 (The Argument Reasoning Comprehension Task). Enabling a computer to understand a text so that it can answer comprehension questions is still a challenging goal of NLP. We propose a Bidirectional LSTM (BiLSTM) model that reads two sentences separated by a delimiter to determine which warrant is correct. We extend this model with a neural attention mechanism that encourages the model to make reasoning over the given claims and reasons. Officially released results show that our system ranks 6th among 22 submissions to this task.


Introduction
Machine comprehension of text is an important problem in natural language processing. Traditional approaches to machine comprehension are based on either hand engineered grammars (Riloff and Thelen, 2000), or information extraction methods (Poon et al., 2010).
Recently, recurrent neural networks (RNNs) with long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) have been successfully applied to a wide range of NLP tasks, such as machine translation (Sutskever et al., 2014), constituency parsing (Vinyals et al., 2015), language modeling (Zaremba et al., 2014) and machine comprehension . A potential issue with the LSTM models is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector (Bahdanau et al., 2014). This may make it difficult for the neural network to cope with long sentences. In order to address this issue, attention mechanisms have been successfully extended to the LSTMs. Attentive Reader  used a tanh layer to compute the attention between document and question embeddings. This allows a model to focus on the aspects of a document that it believes helpful to answer a question. The attention-based LSTM models have achieved state-of-the-art results in machine comprehension tasks (Kadlec et al., 2016;Chen et al., 2016;Tseng et al., 2016). The argument reasoning comprehension task has been presented by (Habernal et al., 2018). The problem can be described as follows: Given an argument consisting of a claim and a reason, the goal is to select the correct warrant that explains reasoning of this particular argument. Compared to traditional machine comprehension task, argument reasoning comprehension requires models to possess extra reasoning abilities. Some models increase the depth of the network, continuously updating the representations of the documents and questions to realize the reasoning process (Sukhbaatar et al., 2015;Tseng et al., 2016;Dhingra et al., 2017;Sordoni et al., 2016).
In this paper, we use a BiLSTM model to encode the reason and claim pairs (reason-claim) and warrants. Then a word-to-sentence neural attention mechanism is implemented to improve the model performance.
The rest of the paper is organized as follows: Section 2 provides the details of the proposed model; Experimental settings and results are discussed in section 3. Finally, we draw conclusions in section 4.

System Description
Firstly, we concatenate the reason-claim and warrants with a delimiter, then we encode the reasonclaim via a BiLSTM. A second BiLSTM with different parameters is used to encode the delimiter and the warrants, but its memory state is initialized with the last cell state of the previous BiLSTM. The attention mechanism is implemented by the   Figure 1: Our BiLSTM model with neural attention for argument reasoning comprehension, basically follows the attention model described in (Rocktäschel et al., 2015).

reason-claim
last output vector of the second BiLSTM and the output vector at each time step produced by the first BiLSTM. Then we use a tanh activation to obtain the final representation. Finally, we predict the correct label via a fully connected layer and a sof tmax activation.

LSTM & BiLSTM
Recurrent Neural Networks (RNNs) have been widely exploited to deal with variable-length sequence input. RNNs are networks with loops in them, allowing information to persist. A potential issue of RNNs is that they become unable to learn to connect the previous information when the length of the document grows. LSTM (Hochreiter and Schmidhuber, 1997) is one of the popular variations of RNN to mitigate the gradient vanish problem. LSTMs have three gates: input gate, forget gate and output gate. Gates are a way to optionally let information through. With these gates, LSTMs can remember information for long periods of time and avoid the long-term dependency problem. Given an input vector x t at time step t, the previous output h t−1 and cell state c t−1 , an LSTM with hidden state size k computes the next output h t and new cell state c t as: are trained biases, σ and denote the sigmoid function and the element-wise multiplication of two vectors, respectively.
Single direction LSTM has one drawback of not using the contextual information from the future tokens. BiLSTM exploits both the previous and future context by processing the sequence on two directions and generates two independent sequences of LSTM output vectors. One processes the input sequence in the forward direction, while the other processes the input in the backward direction. The output at each time step is the concatenation of the two output vectors from both di-

Attention
The LSTM model can alleviate the problem of gradient vanishing, but this problem persists in long range contexts. The attention mechanism is introduced to address this issue. Attention is the idea of freeing the encoder-decoder architecture from the fixed-length internal representation. This is achieved by keeping the intermediate outputs from the encoder LSTM and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence. These attention-based models have achieved stateof-the-art performance on many natural language processing tasks.
Let C ∈ R k×T be a matrix consisting of output vectors [h 1 , h 2 , . . . , h T ] produced by the first BiLSTM when reading the T words of the reasonclaim, where k is a hyperparameter denoting the hidden units of LSTM. Moreover, let h T +N be the last output vector after the reason-claim and warrant are processed by the two BiLSTMs, respectively. The attention mechanism will produce a vector of attention weights and a weighted representation r of the reason-claim via: where e T is a vector of ones, W c , W h ∈ R k×k are trained projection matrices. W m ∈ R k is a trained parameter vector. The final sentence-pair representation is obtained from a non-linear combination of the attention-weighted representation r of the reason-claim and the last output vector where W i , W j ∈ R k×k are trained projection matrices.

Experiments
The organizers provided training, development, and test sets, containing 1210, 316, 444 instances, respectively. We combine the reason and claim to one sentence which can determine if the warrant is correct or not. The word tokenizer we adopted is T weetT okenizer in Natural Language Toolkit (NLTK 1 ).
We compare two word embedding tools, Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). Out-of-vocabulary words in the data sets are randomly initialized by sampling values uniformly from (-0.25, 0.25)  We additionally try bidirectional LSTMs through experiments. Given the small scale of the data sets, we run each model 10 times, taking their average as the final result. We also use data augmentation such as shuffle the sentence order to expand the data set. Specifically, we randomize the word order of the reason-claims and the warrants to double the data set. A randomseed is set to ensure our results are reproducible.

Model
Dev Acc Test Acc BiLSTM 0.690 0.583 BiLSTM+Shuffle 0.642 0.570 Table 2: Performance on models with or without shuffle. Both models are based on attention-based BiLSTM + GloVe architecture.
The results show that data augmentation like shuffling the sentence order does not have much

Conclusion and Future Work
In this paper, we present a BiLSTM model for argument reasoning comprehension. We adopt a word-to-sentence attention mechanism to make model perform better. In the future, we will utilize external knowledge to enhance the reasoning ability of our models. We will also pay more attention to the generalization of models on small data sets.