SNU_IDS at SemEval-2018 Task 12: Sentence Encoder with Contextualized Vectors for Argument Reasoning Comprehension

We present a novel neural architecture for the Argument Reasoning Comprehension task of SemEval 2018. It is a simple neural network consisting of three parts, collectively judging whether the logic built on a set of given sentences (a claim, reason, and warrant) is plausible or not. The model utilizes contextualized word vectors pre-trained on large machine translation (MT) datasets as a form of transfer learning, which can help to mitigate the lack of training data. Quantitative analysis shows that simply leveraging LSTMs trained on MT datasets outperforms several baselines and non-transferred models, achieving accuracies of about 70% on the development set and about 60% on the test set.


Introduction
The Argument Reasoning Comprehension Task (Habernal et al., 2018) is a newly released task that tackles the core of reasoning in natural language argumentation, highlighting the importance of implicit warrants.
Even though the task could be regarded as simple binary classification, it is quite challenging in several perspectives. First, the task requires human-level reasoning to judge whether a claim supported by a reason and a warrant is logically correct. Second, common knowledge, which is not present in the input sentences themselves, is often required to solve the problem. Third, even though each instance of the data is helpful, the number of training data is relatively small to train prevailing complex neural models such as convolutional neural networks (Kim, 2014;Kalchbrenner et al., 2014) and recurrent neural networks (Hochreiter and Schmidhuber, 1997;Chung et al., 2014) with (or without) attention mechanisms (Liu et al., 2016;Lin et al., 2017).
In this paper, we propose a new architecture named SECOVARC 1 (Sentence Encoder with COnextualized Vectors for Argument Reasoning Comprehension) to deal with the complicated task. The main idea behind our model is that transfer learning can be a remedy to resolve the difficulties we face. With experimental results and analysis, we show that the simple neural model enhanced by transferred knowledge can be competitive, compared to complex models trained on the given data only.

Argument Reasoning Comprehension
The argument reasoning comprehension task is a new dataset whose goal is to choose the correct implicit reasoning from two warrants, given a natural language argument with a reason and a claim. It consists of about 2K crowdsourced instances, each of which has a title and a short description of the debate from which the claim, reason, and two candidates arose. For more details, refer to Habernal et al. (2018).

Transfer Learning in NLP
Transfer learning is a classic technique in machine learning, which seeks to transfer beneficial knowledge from external resources to target models. It is well-known to be effective especially when one suffers from the lack of training data.
An important example showing the successfulness of transfer learning in natural language processing (NLP) is pre-trained word representations such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), on which most of the modern models for NLP have been built. Furthermore, there are some recent works (Collobert et al., 2011;Mou et al., 2016;Min et al., 2017) that concentrate on pre-training more sophisticated neural modules over word embeddings, proving that transfer learning can be a key to boost the performance of NLP systems.

Unsupervised Sentence Representation
Following the success of unsupervised word representations, there arises another line of research to facilitate transfer learning in sentence-level. The idea is that a generic sentence encoder, which is pre-trained in an unsupervised way, can generate sentence representations suitable for downstream tasks.
For instance, Kiros et al. (2015) propose an approach called Skip-Thoughts vectors that abstracts the skip-gram of Word2Vec (Mikolov et al., 2013) to the sentence-level. Moreover, many other unsupervised methods (Le and Mikolov, 2014;Dai and Le, 2015;Hill et al., 2016;Gan et al., 2017;Chen, 2017) are also introduced as a way of building sentence representations.

Supervised Sentence Representation
Despite several attempts at learning sentence representations in an unsupervised manner, there has been no consensus established thus far, on which is the best method and can be adopted as a standard.
Meanwhile, sentence encoders trained on labeled datasets are proposed as an alternative, showing that they outperform the previous models even with the limited number of data. Conneau et al. (2017) suggest a method named InferSent, which uses a simple bidirectional LSTM (Long Short Term Memory, Hochreiter and Schmidhuber (1997)) with max-pooling trained on the Stanford Natural Language Inference (SNLI, Bowman et al. (2015)). And McCann et al. (2017) propose CoVe and demonstrate that the encoder part of the trained sequence-to-sequence  model for machine translation can be reused as a generic sentence encoder.
In the paper, we focus on supervised pretraining with external data as an instantiation of transfer learning.

Model
In this section, We describe SECOVARC ( Figure  1) which takes a set of 3 sentences, i.e. a claim, reason, and warrant, as input and outputs a score between 0 and 1, indicating how reasonable the claim is when it is based on the reason and the warrant.

Model Design
Before jumping into the details, we explain about our motivation upon which the decisions on model design were made.
First, we let the model accept only one warrant instead of two candidates. This decision comes from the intuition that it may learn how to reason better when it judges whether the logic constructed on a set of a claim, reason, and warrant is plausible, instead of just choosing the more probable one between the two candidates.
Second, as mentioned earlier, one of the main concerns behind the model design is the lack of training data. To alleviate this problem, we decide to utilize transfer learning while maintaining the model as simple as possible (e.g. without introducing complex architectures such as attention mechanism).

Model Specification
In this part, we describe the details of the proposed model, which is composed of three layers.

Encoding Layer
The encoding layer is the first part of our model, which is in charge of encoding three input sentences to corresponding sentence representations. In detail, it accepts the sequence of n words (w 1 , w 2 , . . . , w n ) in a sentence at a time and outputs a fixed-length sentence representation s. Note that the same generic encoder is used to encode each input sentence.
Formally, each one-hot encoded word w i ∈ R V of the input sentence is converted into the corresponding word vector x i ∈ R dw by a word embedding matrix E ∈ R dw×V . Then, a sequence of the word vectors x = [x 1 , x 2 , . . . , x n ] is combined into s ∈ R ds by an encoder. While a wide range of selection for the encoder is possible, in our case we utilize CoVe 2 (McCann et al., 2017) (with pooling operation), which is a two-layered Bi-LSTM pre-trained on large MT datasets, to obtain meaningful and contextualized sentence representations that would not be achieved if we train the encoder from scratch.
As a result, each representation for the claim (s c ), reason (s r ), and warrant (s w ) is derived from x c , x r and x w as follows.
From various options for the pooling operation, we use max-pooling, which selects the maximum value over each dimension of the output, and lastpooling that just selects the last state of the output. We call the CoVe encoder with max-pooling as SECOVARC-max and the encoder with lastpooling as SECOVARC-last.

Localization Layer
Although all of the input sentences (i.e. the claim, reason, and warrant) are encoded by the universal encoder, there is a need to make a difference among them so that each of the sentence representations keeps its own role. For this reason, the localization layer is introduced to project (or 'localize') each s onto its own semantic space.
We implement this layer simply in the form of three separate fully-connected layers, pursuing the intuition that our model should be simple. Therefore, a set of the sentence representations

Output Layer
The output layer collects all features extracted from the previous layer and computes a final score between 0 and 1. To help the model make correct decisions, we introduce heuristic methods such as |v w − v r − v c | and v w v r v c 3 , inspired from the work of Mou et al. (2015) for the SNLI task.
In the end, a final feature v f for computing a score y ∈ R (0 ≤ y ≤ 1) becomes a concatenation of the five vectors, where v f ∈ R 5d f . Then, logistic regression (for simplicity) is performed on v f to compute the final score.
During training, the score can be directly utilized to optimize the model. At test time, on the other hand, we derive y 1 and y 2 from the trained model with the input sentences such that y 1 = SECOVARC(c, r, w 1 ) where c, r, w 1 and w 2 is the claim, the reason, the first warrant, and the second warrant respectively. Then, we select the warrant whose score is greater than that of the other as a final decision.

Data Manipulation
As our model requires only one warrant at a time, data preprocessing is inevitable before training. We manipulate the original data so that the correct warrant has a score of 1 and the opposite warrant has 0. Note that this pre-processing procedure has a side effect of doubling the original training data.

Training Details
The dimension of a word vector (d e ) is fixed to 300. And hyper-parameters for other vectors are set to d s = 600, d f = 300. We use 840B GloVe to initialize a word embedding matrix. Other model weights are randomly sampled from uniform distribution(-0.005, 0.005), except for the CoVe encoder, and biases are initialized with 0.
Our model is trained using Adam (Kingma and Ba, 2014) optimizer with a learning rate 0.001 and a batch size 64. The maximum number of training epoch is limited to 10 and we choose the best model based on development accuracy. All parameters in the model, including the word vectors, are fine-tuned during training.
For regularization, L2-norm of the parameters is added to the Cross Entropy objective with the weight of 1e-5, and Dropout (Srivastava et al., 2014) technique is also applied with p = 0.1. Table 1 shows the accuracies of variants of our model and baselines (Habernal et al., 2018) on the development set and the test set. Due to the instability of results caused by random initialization, we report the mean and standard deviation of 20 experimental runs (with the same hyperparameters) for each model.

Experimental Results
The reported results show that SECOVARC-last (w/ heuristics) outperforms all the baselines on the development set, with a mean accuracy of 70.6%. However, it is SECOVARC-max (w/ heuristics) that performs best on the test set, with a mean accuracy of 59.2%. We submitted an instance obtained from SECOVARC-last (w/ heuristics) and achieved the official result of 56.5% on the leaderboard.  Table 2: Experiment on the possibility of transfer learning in case of the argument reasoning comprehension task. Note that the heuristic methods are employed for all models. layer, except for the test accuracy of SECOVARClast.

Does transfer learning really work?
Even with the promising outcome presented by SECOVARC, an issue remains regarding how to show the effectiveness of transfer learning for the task. For this objective, we conduct additional experiments with three baselines called BoW, Bi-LSTM-last, and Bi-LSTM-max. Bi-LSTM-last and Bi-LSTM-max have the same architecture with SECOVARC, but the Bi-LSTMs in the encoding layer are randomly initialized rather than pre-trained. BoW is different from our proposed model in that it leverages the average of word vectors as a sentence representation instead of using CoVe with pooling. Table 2 reports the comparison of the baselines and the variants of our model. The results show that our model consistently outperforms the baselines which are trained from scratch. Moreover, the smaller deviations of SECOVARCs demonstrate that transfer learning can lead to more stable and successful training of models.

Conclusion
In this paper, we present a novel neural architecture called SECOVARC, that utilizes a twolayered Bi-LSTM trained first on a large amount of machine translation data. And we demonstrate that the neural model for the argument reasoning comprehension task can benefit from transfer learning when it is properly designed.
As a future work, there is a way to apply contemporary works for generic sentence encoders such as Subramanian et al. (2018) and Peters et al. (2018) instead of CoVe. On the other hand, we can consider expanding the data itself directly with sophisticated rules or heuristics.