ITNLP-ARC at SemEval-2018 Task 12: Argument Reasoning Comprehension with Attention

Reasoning is a very important topic and has many important applications in the field of natural language processing. Semantic Evaluation (SemEval) 2018 Task 12 “The Argument Reasoning Comprehension” committed to research natural language reasoning. In this task, we proposed a novel argument reasoning comprehension system, ITNLP-ARC, which use Neural Networks technology to solve this problem. In our system, the LSTM model is involved to encode both the premise sentences and the warrant sentences. The attention model is used to merge the two premise sentence vectors. Through comparing the similarity between the attention vector and each of the two warrant vectors, we choose the one with higher similarity as our system’s final answer.


Introduction
Reasoning is a very challenging, but basic part of Natural Language Inference (NLI) (Chen et al., 2017), and many relevant tasks have been proposed such as Recognizing Textual Entailment (RTE) and so on. Stanford University provided Stanford Natural Language Inference (SNLI) corpus to support Natural Language Inference task. It contained two kinds of sentences-the premise sentence and the warrant sentence.The mission is to judge whether the two sentences are inference or not. Semantic Evaluation (SemEval) 2018 Task 12-The Argument Reasoning Comprehensiongive an argument consisting of the claim, the reason and two warrants. The goal is to select the correct warrant that explains reasoning with this particular argument. There are two options given and only one is correct. Compare with Stanford Natural Language Inference (SNLI) task (Bowman et al., 2015;Shen et al., 2018), it has more challenges. Because it has abundant premise in-formation such as the reason, the claim, text information, as well as the option warrants have high semantic textual similarity (Habernal et al., 2017). In this task, we need to find an effective method to extract important information from these premise sentences.
Natural Language Reasoning can be applied to various fields such as question and answering, information retrieval and so on. With the development of Neural Networks applied in Natural Language Processing, sentence representation and reasoning have been researched and taken significant step forwards. In order to deal with the sequence problem, recurrent neural networks (RNN) (Mikolov et al., 2010(Mikolov et al., , 2011 proposes the concept of hidden state, which can extract features from sequence-shaped data and then convert it to output. It can be used to encode the sentence to fixed-length vector representations. In most recent years, long short-term memory (LSTM) network (Bengio et al., 1994;Hochreiter and Schmidhuber, 1997), BiLSTM (Pennington et al., 2014a) and gated recurrent unit (GRU) (Cho et al., 2014) are widely used to get sentence representative vector, and achieved better result compared with traditional methods. Attention model also known as alignment model pays more attention to two sentences interaction (Zheng et al., 2018;Gao et al., 2018), which is usually applied in information extraction, relation extraction, text summarization and machine translation. In machine translation, the attention model can be focused on one or a few words of input to make the translation more accurate when generating each new word. (Rocktäschel et al., 2015) extend a neural word-by-word attention mechanism to encourage reasoning over entailment of pairs of words and phrases.
In our system, we use long short-term memory network to encode sentence. To make full use of the information of the reason and the claim, we use attention model to get the attention sentence vector. Then, we compare the warrant sentence vector and the attention sentence vector similarity. The warrant with higher similarity is taken as an answer. In order to make the system more accurate, we use ensemble result as our final answer.

Method
The dataset composes with four items which are the reason, the claim, the warrant and the alternative warrant (R, C, W, AW), and two additional information: debateTitle and debateInfo. Let R be a reason for a claim C, both of which are propositions extracted from debateTitle and debateInfo. There are two warrants (AW, W) that justify the use of the reason R as support for the claim C. In this task, we choose the correct warrant by these premise information. In our system, we encode sentence with LSTM, and merge two sentences with attention. Then choose the one (AW or W) with higher similarity between the warrant vector and the attention vector as our answer. The system's neural networks model shown as Fig 1. We build the system with five parts, the following is a detailed description.

LSTM
Long short-term memory (LSTM) network is a variant of RNN, and it has been successfully applied to various kinds of NLP tasks. It can solve RNN's problem of gradient vanishing and gradient explosion and be good at dealing with sequenceshaped data. LSTM model controls the memory unit through the input gate, output gate and forget gate. The input is a sequence of sentence X = {x 1 , x 2 , . . . , x n }, where x i is the word vector of i'th word in the sentence. The output is Here, we use the pretrained vector of global vectors (GloVe) (Pennington et al., 2014b) as the embedding layer initialization, and the word embedding dimension is 300. The formulas for LSTM include: (1) In our experiment,we encode the reason sentence and the claim sentence with one LSTM encoder, and encode the warrant sentence with another. We try to use the LSTM's last output, mean pooling and max pooling as the sentence vector representation.

Attention
In argument reasoning comprehension task, the claim sentence is extracted from title and information, and it supports the result. Therefore, the claim has a great impact on the reason sentence. So, we use attention model to force the reason's and the claim's similarity word, and get the better premise sentence representation. In this task, we use two kinds of attention model to merge reasons and claims vector representation. Let's R ∈ R k×lr be a matrix consisting of the reason's LSTM output vector R = {r 1 , r 2 , . . . , r lr }, and C ∈ R k×lc be a matrix consisting of the claim's LSTM layer output vector C = {c 1 , c 2 , . . . , c lc }, where l r is the length of the reason, l c is the length of the claim, and k is the LSTM's outputs dimension.
One of the attention model is seq-attention model. In our system, we try to represent the claim sentence vector as c ∈ R k , where c is the LSTM's last output, mean pooling or max pooling. Then, calculating the claim sentence vector c and the reason sentence vecotor's {r 1 , r 2 , . . . , r lr } similarity as the attention weight. We use the result of two vectors multiplication as the similarity weight. Finally, we can obtain the reason sentence vector with weight. The calculation process is as following: where α is the attention weight. The attention vector represent as Rtt * ∈ R k . Another kind of attention uses matrix to calculate the weight of the claim sentence vector and the reason sentence vector. Give each sentence vector a weight matrix, and obtain the attention vector by learning the weight matrix. The formula is: where W y ∈ R k×k , W h ∈ R k×k , w ∈ R k , W p ∈ R k×k and W x ∈ R k×k is a matrix with random initialization. C n is the LSTM's last output, max pooling or mean pooling, and α is the attention weight. Rtt * ∈ R k is the attention vector.

Text Similarity
There are many ways to calculate the similarity of text vectors, such as cosine distance, dot product and so on. In our system, we use a Bilinear way to calculate the similarity of attention vector and warrant vector. The formula is: where Rtt * is the attention vector, W m ∈ R k×k is the randomly initialized weight matrix, and W is the warrant sentence vector that using LSTM's last output, max pooling or mean pooling.

Ensemble
Since neural networks have a large number of random parameters, we try to use different random initialization or change the network layer dimensions to adjust the network structure. In order to make the prediction more accurate, we run the program many times and use the voting method to obtain the final result.

Loss Function and Evaluation
We treat this task as a classification problem, and use log-loss as our loss function. The format is: where y i is the label of i'th instance, and h i is the probability calculated by the system.
We also treat it as a sort problem, and choosing the top 1 of sorting results as the answer. The loss function format is: max(0, 1 − sim(r, wa) + sim(r, w)) (16) where sim(r, wa) is the true similarity of the premise and the warrant, and sim(r, w) is the false similarity of the premise and the warrant.
Systems will be scored using accuracy. The format is: 3 Experiments and Results Table 1 shows the parameter setting in our system. Because we use Tensorflow to build our system, the sentence needs to be set to a fixed length. The sentences with length greater than 30 words are lstm input unit lstm output unit lstm input dropout lstm output dropout epoch 300 200 0.6 0.6 40    truncated from the back, with length less than 30 words are added 0 in the behind.
In our system, we build the argument reasoning comprehension task with neural networks. We try to use the LSTM's last output, max pooling or mean pooling to represent the sentence vector, and use two kinds of attention to merge the reason and the claim. Because of neural networks contains a lot number of randomly initialized parameters, we run our system ten times and average the accuracy. Table 2 shows the accuracy with log-loss function. Table 3 shows the accuracy with sort loss function. From Table 2 and Table 3, we can get conclusion that mean pooling performed better than last output and max pooling. Table 4 shows the accuracy ensemble all neural network model, and this is our system's final result.

Conclusion and Future Works
We propose a neural network model to solve reasoning in NLP. We use attention model and bilinear to calculate the similarity between the premise and the warrant. Our system's final result achieved 0.5521. From the experiment, we can see the train accuracy and the development accuracy is much higher than test accuracy. This may be due to over fitting. Maybe decreasing learning rate, and using batch normalization can reduce over fitting. We will try it in the future work.