Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference

The RepEval 2017 Shared Task aims to evaluate natural language understanding models for sentence representation, in which a sentence is represented as a fixed-length vector with neural networks and the quality of the representation is tested with a natural language inference task. This paper describes our system (alpha) that is ranked among the top in the Shared Task, on both the in-domain test set (obtaining a 74.9% accuracy) and on the cross-domain test set (also attaining a 74.9% accuracy), demonstrating that the model generalizes well to the cross-domain data. Our model is equipped with intra-sentence gated-attention composition which helps achieve a better performance. In addition to submitting our model to the Shared Task, we have also tested it on the Stanford Natural Language Inference (SNLI) dataset. We obtain an accuracy of 85.5%, which is the best reported result on SNLI when cross-sentence attention is not allowed, the same condition enforced in RepEval 2017.


Introduction
The RepEval 2017 Shared Task aims to evaluate language understanding models for sentence representation with natural language inference (NLI) tasks, where a sentence is represented as a fixedlength vector.
Modeling inference in human language is very challenging but is a basic problem in natural language understanding.Specifically, NLI is concerned with determining whether a hypothesis sentence h can be inferred from a premise sentence p.
Most previous top-performing neural network models on NLI use attention models between a premise and its hypothesis, while how much information can be encoded in a fixed-length vector without such cross-sentence attention deserves some further understanding.In this paper, we describe the model we submitted to the RepEval 2017 Shared Task (Nangia et al., 2017), which achieves the top performance on both the indomain and cross-domain test set.

Related Work
Natural language inference (NLI), also named recognizing textual entailment (RTE) includes a large bulk of early work on rather small datasets with more conventional methods (Dagan et al., 2005;MacCartney, 2009).More recently, the large datasets are available, which makes it possible to train natural language inference models based on neural networks (Bowman et al., 2015;Williams et al., 2017).
Natural language inference models based on neural networks are mainly separated into two kind of ways, sentence encoder-based models and cross-sentence attention-based models.Among them, Enhanced Sequential Inference Model (ESIM) with cross-sentence attention represents the state of the art (Chen et al., 2016b).However, in this paper we principally concentrate on sentence encoder-based model.Many researchers have studied sentence encoder-based model for natural language inference (Bowman et al., 2015;Vendrov et al., 2015;Mou et al., 2016;Bowman et al., 2016;Munkhdalai and Yu, 2016a,b;Liu et al., 2016;Lin et al., 2017).It is, however, not very clear if the potential of the sentence encoderbased model has been well exploited.In this paper, we demonstrate that proposed models based on gated-attention can achieve a new state-of-theart performance for natural language inference.

Methods
We present here the proposed natural language inference networks which are composed of the following major components: word embedding, sequence encoder, composition layer, and the toplayer classifier.Figure 1 shows a view of the architecture of our neural language inference network.
Figure 1: A view of our neural language inference network.

Word Embedding
In our notation, a sentence (premise or hypothesis) is indicated as x = (x 1 , . . ., x l ), where l is the length of the sentence.We concatenate embeddings learned at two different levels to represent each word in the sentence: the character composition and holistic word-level embedding.The character composition feeds all characters of each word into a convolutional neural network (CNN) with max-pooling (Kim, 2014) to obtain representations c = (c 1 , . . ., c l ).In addition, we also use the pre-trained GloVe vectors (Pennington et al., 2014) for each word as holistic wordlevel embedding w = (w 1 , . . ., w l ).Therefore, each word is represented as a concatenation of the character-composition vector and word-level embedding e = ([c 1 ; w 1 ], . . ., [c l ; w l ]).This is performed on both the premise and hypothesis, resulting into two matrices: the e p ∈ R n×dw for a premise and the e h ∈ R m×dw for a hypothesis, where n and m are the length of the premise and hypothesis respectively, and d w is the embedding dimension.

Sequence Encoder
To represent words and their context in a premise and hypothesis, sentence pairs are fed into sentence encoders to obtain hidden vectors (h p and h h ).We use stacked bidirectional LSTMs (BiL-STM) as the encoders.Shortcut connections are applied, which concatenate word embeddings and input hidden states at each layer in the stacked BiLSTM except for the bottom layer. (1) where d is the dimension of hidden states of LSTMs.A BiLSTM concatenate a forward and backward LSTM on a sequence starting from the left and the right end, respectively.Hidden states of unidirectional LSTM ( where σ is the sigmoid function, is the elementwise multiplication of two vectors, and

Composition Layer
To transform sentences into fixed-length vector representations and reason using those representations, we need to compose the hidden vectors obtained by the sequence encoder layer (h p and h h ).
We propose intra-sentence gated-attention to obtain a fixed-length vector.Illustrated by the case of hidden states of premise h p , . * 2 indicates l 2 -norm, which converts vectors to scalars.The idea of gated-attention is inspired by the fact that human only remember important parts after they read sentences.(Liu et al., 2016;Lin et al., 2017) proposed a similar "inner-attention" mechanism but it's calculated by an extra MLP layer which would require more computation than us.
We also use average-pooling and max-pooling to obtain fixed-length vectors v a and v m as in Chen et al. (2016b).Then, the final fixed-length vector representation of premise is As for hidden states of hypothesis h h , we can obtain v h through similar calculation procedure.Consequently, both the premise and hypothesis are fed into the composition layer to obtain fixed-length vector representations respectively (v p , v h ).

Top-layer Classifier
Our inference model feeds the resulting vectors obtained above to the final classifier to determine the overall inference relationship.In our models, we compute the absolute difference and the element-wise product for the tuple The absolute difference and element-wise product are then concatenated with the original vectors v p and v h (Mou et al., 2016).
We then put the vector v inp into a final multilayer perceptron (MLP) classifier.The MLP has 2 hidden layers with ReLu activation with shortcut connections and a softmax output layer in our experiments.The entire model (all four components described above) is trained and the cross-entropy loss of the training set is minimized.

Experimental Setup
Data RepEval 2017 use Multi-Genre NLI corpus (MultiNLI) (Williams et al., 2017), which focuses on three basic relationships between a premise and a potential hypothesis: the premise entails the hypothesis (entailment), they contradict each other (contradiction), or they are not related (neutral).The corpus has ten genres, such as fiction, letters, telephone speech and so on.Training set only has five genres of them, therefore there are in-domain and cross-domain development/test sets.SNLI (Bowman et al., 2015) corpus can be used as an additional training/development set, which includes content from the single genre of image captions.However, we don't use SNLI as an additional training/development data in our experiments.
Training We use the in-domain development set to select models for testing.To help replicate our results, we publish our code at https: //github.com/lukecq1231/enc_nli(the core code is also used or adapted for a summarization (Chen et al., 2016a) and a question-answering task (Zhang et al., 2017)).We use the Adam (Kingma and Ba, 2014) for optimization.Stacked BiLSTM has 3 layers, and all hidden states of BiLSTMs and MLP have 600 dimensions.The character embedding has 15 dimensions, and CNN filters length is [1,3,5], each of those is 100 dimensions.We use pretrained GloVe-840B-300D vectors (Pennington et al., 2014) as our word-level embeddings and fix these embeddings during the training process.Out-of-vocabulary (OOV) words are initialized randomly with Gaussian samples.

Results
Table 1 shows the results of different models.The first group of models are copied from Williams et al. (2017).The first sentence encoder is based on continuous bag of words (CBOW), the second is based on BiLSTM, and the third model is Enhanced Sequential Inference Model (ESIM) (Chen et al., 2016b) reimplemented by Williams et al. (2017), which represents the state of the art on SNLI dataset.However, ESIM uses attention between sentence pairs, which is not a sentenceencoder based model.The second group of models are the results of other teams which participate the RepEval 2017 Share Task competition (Nangia et al., 2017).
In addition, we also use our implementation of ESIM, which achieves an accuracy of 76.8% in the in-domain test set, and 75.8% in the cross-domain test set, which presents the state-of-the-art results.After removing the cross-sentence attention and adding our gated-attention model, we achieve accuracies of 73.5% and 73.6%, which ranks first in the cross-domain test set and ranks second in the in-domain test set among the single models.
When ensembling our models, we obtain accuracies 74.9% and 74.9%, which ranks first in both test sets.Our ensembling is performed by averaging the five models trained with different parameter initialization.
We compare the performance of using different gate in gate-attention in the fourth group of Table 1.Note that we use attention based on input gate on all other experiments.
To understand the importance of the different elements of the proposed model, we remove some crucial elements from our single model.If we remove the gated-attention, the accuracies drop to 72.8% and 73.6%.If we remove charactercomposition vector, the accuracies drop to 72.9% and 73.5%.If we remove word-level embedding, the accuracies drop to 65.6% and 66.0%.If we re-
move absolute difference and element-wise product of the sentence representation vectors, the accuracies drop to 69.7% and 69.2%.
In addition to testing on this shared task, we have also applied our best single system (without ensembling) on the SNLI dataset; our model achieve an accuracy of 85.5%, which is best result reported on SNLI, outperforming all previous models when cross-sentence attention is not allowed.The previous state-of-the-art sentence encoder-based model (Munkhdalai and Yu, 2016b), called neural semantic encoders (NSE), only achieved an accuracy of 84.6% on SNLI.Table 2 shows the results of previous models and proposed model.

Summary and Future Work
We describe our system that encodes a sentence to a fixed-length vector for natural language inference, which achieves the top performances on both the RepEval-2017 and the SNLI dataset.The model is equipped with a novel intra-sentence gated-attention component.The model only uses a common stacked BiLSTM as the building block together with the intra-sentence gated-attention in order to compose the fixed-length representations.Our model could be used on other sentence encoding tasks.Future work on NLI includes exploring the usefulness of external resources such as Word-Net and contrasting-meaning embedding (Chen et al., 2015).
are weight matrices to be learned.For each input vector x t at time step t, LSTM applies a set of gating functionsthe input gate i t , forget gate f t , and output gate o t , together with a memory cell c t , to control message flow and track long-distance information (Hochreiter and Schmidhuber, 1997) and generate a hidden state h t at each time step.
, f t , o t are the input gate, forget gate, and output gate in the BiLSTM of the top layer.Note that the gates are concatenated by forward and backward LSTM, i.e., i t = [

Table 1 :
Accuracies of the models on MultiNLI.Note that * indicates that the model participate in the competition on June 15th, 2017.