Research on attention memory networks as a model for learning natural language inference

Natural Language Inference (NLI) is a fundamentally important task in natural language processing that has many applications. It is concerned with classifying the logical relation between two sentences. In this paper, we propose attention memory networks (AMNs) to recognize entailment and contradiction be-tween two sentences. In our model, an attention memory neural network (AMNN) has a variable sized encoding memory and supports semantic compositionality. AMNN captures sentence level semantics and reasons relation between the sentence pairs; then we use a S-parsemax layer over the output of the generated matching vectors (sentences) for classiﬁ-cation. Our experiments on the Stanford Natural Language Inference (SNLI) Corpus show that our model outperforms the state of the art, achieving an accuracy of 87.4% on the test data.


Introduction
Natural Language Inference (NLI) refers to the problem of determining entailment and contradiction relationships between two sentences.The challenge in Natural Language Inference, also known as Recognizing Textual Entailment (RTE), is to correctly decide whether a sentence (called a hypothesis) entails or contradicts or is neutral in respect to another sentence (referred to as a premise).Provided with a premise sentence, the task is to judge whether the hypothesis can be inferred (Entailment) or the hypothesis cannot be true (Contradiction) or the truth is unknown (Neutral).Few examples are illustrated in Table 1.
NLI is the core of natural language understanding and has wide applications in NLP, e.g., automatic text summarization (Yan et al., 2011a;RuiYan et al., 2011b); and question answering (Harabagiu and Hickl, 2006).Moreover, NLI is also related to other tasks of sentence pair modeling, including relation recognition of discourse units (YangLiu et al., 2016), paraphrase detection (Hu et al., 2014), etc.
Bowman released the Stanford Natural Language Inference (SNLI) corpus for the purpose of encouraging more learning centered approaches to NLI (Bowman et al., 2015).Published SNLI corpus makes it possible to use deep learning methods to solve NLI problems.So far proposed work based on neural networks for text similarity tasks including NLI have been published in recent years (Hu et al., 2014;Wang and Jiang, 2015;Rocktaschel et al., 2016;Yin et al., 2016);.The core of these models is to build deep sentence encoding models, for example, with convolutional networks (LeCun et al., 1990) or long short-term memory networks (Hochreiter and Schmidhuber, 1997) with the goal of deeper semantic encoders.Recurrent neural networks (RNNs) equipped with internal short memories, such as long short-term memories (LSTMs) have achieved a notable success in sentence encoding.LSTMs are powerful because it learns to control its short term memories.However, the short term memories in LSTMs are a part of the training parameters.This imposes some practical difficulties in training and modeling long sequences with LSTMs.
In this paper, we proposed a deep learning framework for natural language inference, which mainly consists of two layers.As we can see from the fig-

Hypothesis
Label A person throwing a yellow ball in the air.
The ball sails through the air.Entailment A person throwing a yellow ball in the air.
The person throws a square.Contradiction A person throwing a yellow ball in the air.
The ball is heavy.Neutral ure 1, from top to bottom are: (A) The sentence encoding layer (Figure 1a); (B) The sentence matching layer (Figure 1b).In the sentence encoding layer, we introduce an attention memory neural network (AMNN), which has a variable sized encoding memory and naturally supports semantic compositionality.The encoding memory evolves over time, whose size can be altered depending on the length of input sequences.In the sentence matching layer, we directly model the relation between two sentences to extract relations between premise and hypothesis, and dont generate sentence representations.In addition, we introduce the Sparsemax (Yin and Schutze, 2015) , a new activation function similar to the traditional Softmax, but is able to output sparse probability distributions; then, we present a new smooth and convex loss function, Sparsemax loss function, which is the Sparsemax analogue of the logistic loss.We will explain the two layers in detail in the following subsection.

Proposed Approach
In our model, we adopt a two-step strategy to classify the relation between two sentences.Concretely, our model comprises two parts: • The sentence encoding layer (Figure 1a).This part is mainly a sentence semantic encoder, aiming to capture general semantics of sentences.
• The sentence matching layer (Figure 1b).This part mainly introduces how vector representations are combined to capture the relation between the premise and hypothesis for classification.

The sentence encoding layer: AMNN
In this layer, we introduce an attention memory neural network (AMNN), which implements an attention controller and a variable sized encoding memory, and naturally supports semantic compositionality.AMNN has four main components: Input, Output and Attention memory modules, and an encoding memory.We then examine each module in detail and give intuitions about its formulations.Suppose we are given an set {X i , Y i } N i=1 , where the input X i is a sequence w i 1 , w i 2 , ..., w i T i of tokens, and Y i can be an output sequence.The encoding memory M ∈ S d×l has a variable number of slots, where d is the embedding dimension and l is the length of the input sequence.Each memory slot vector m t ∈ S d corresponds to the vector representation of w t .In particular, the memory is initialized with the raw embedding vector at time t=0.As Attention memory module reads more input content in time, the initial memory evolves over time and refines the encoded sequence.
Input module reads an embedding vector.Attention memory module looks for the slots related to the input by computing semantic similarity between each memory slot and the hidden state.We calculate the similarity by the dot product and transform the similarity scores to the fuzzy key vector by normalizing with Softmax function.Since our key vector is fuzzy, the slot to be composed is retrieved by taking weighted sum of the all slots.In this process, our memory is analogous to the soft attention mechanism.We compose the retrieved slot with the current hidden state and map the resulting vector to the encoder output space.Finally, we write the new representation to the memory location pointed by the key vector.
In our recurrent network, we use a gated recurrent network (Cho et al., 2014a;Chung et al., 2014).We also explored the more complex LSTM (Hochreiter and Schmidhuber, 1997) but it performed similarly and is more computationally expensive.Both work much better than the standard tanh RNN and we postulate that the main strength comes from having gates that allow the model to suffer less from the vanishing gradient problem (Hochreiter and Schmidhuber, 1997).
Concretely, let v l ∈ R l and v d ∈ R d be vectors, and given a input function f GRU input , a output function f GRU output , and the key vector output a t , the output state h t and the encoding memory M t in time step t as where the input function f GRU input and the output function f GRU output are neural networks, also are the training parameters in the model.We abbreviate the above computation with M t = GRU (x t , M t−1 ).Equation (1) is a matrix of ones, ⊗ denotes the outer product which duplicates its left vector l or d times to form a matrix.The function f GRU input sequentially maps the word embeddings to the internal space of the memory w t−1 .Then Equation ( 2),(3),(4),and (5) retrieves a memory slot m t that is semantically associated with the current input word w t , and combines the slot m t with the input w t , and then transforms the composition vector to the encoding memory and rewrites the resulting new representation into the slot location of the memory space.The slot location (ranging from 1 to d) is defined by a key vector a t which the Input module emits by attending over the memory slots.In GRU (x t , M t−1 ), the slot that was retrieved is erased and then the new representation is located.Attention memory module performs this iterative process until all words in the input sequence is read, and performs the input and output operations in every time step.The encoding memories {M } T t=1 and output states {h} T t=1 are further used for the tasks.

The sentence matching layer
Combining sentences encoding: In this part, we introduce how vector representations of individual sentences are combined to capture the relation between the premise and hypothesis.Three matching methods were applied to extract relations.

• Concatenation of the two representations
• Element-wise product

• Element-wise difference
This matching architecture was first used by (Mou et al., 2015) The first matching method follows the most standard procedure of the Siamese architectures, while the latter two are certain measures of similarity or closeness.This matching process is further concatenated (Figure 1b), given by where V p and V h are the sentence vectors of the premise and hypothesis, respectively; denotes element-wise product; semicolons refer to column vector concatenation.V c is the generated matching vector of the matching layer.
We would like to point out that, with subsequent linear transformation, element-wise difference is a special case of concatenation.If we assume the subsequent transformation takes the form of W [V p V h ] T , where W=[W 1 W 2 ] is the weights for concatenated sentence representations, then element-wise difference can be viewed as such that W 0 (V p − V h ) W 0 is the weights corresponding to element-wise difference).Thus, our third heuristic can be absorbed into the first one in terms of model capacity.However, as will be shown in the experiment, explicitly specifying this heuristic significantly improves the performance, indicating that optimization differs, despite the same model capacity.Moreover, word embedding studies show that linear offset of vectors can capture relationships between two words (Mikolov et al., 2013b), but it has not been exploited in sentence-pair relation recognition.Although element-wise distance is used to detect paraphrase in (He et al., 2015), it mainly reflects similarity information.Our study verifies that vector offset is useful in capturing generic sentence relationships, akin to the word analogy task.Sparsemax Transformation: In this part, we introduce the Sparsemax transformation, which has similar properties to the traditional Softmax, but is able to output sparse probability distributions.This transformation was first used by Andre (Martins and Astudillo, 2016).Let K−1 := p ∈ R K |1 T p = 1, p ≥ 0 be the (K-1)-dimensional simplex.We are interested in functions that map vectors in R K to probability distributions in K−1 .Such functions are useful for converting a vector of real weights (e.g., label scores) to a probability distribution (e.g.posterior probabilities of labels).The Sparsemax function, defined componentwise as: Sparsemax has the distinctive feature that it can return sparse posterior distributions, that is, it may assign exactly zero probability to some of its output variables.This property makes it appealing to be used as a filter for large output spaces, to predict multiple labels, or as a component to identify which of a group of variables are potentially relevant for a decision, making the model more interpretable.Crucially, this is done while preserving most of the attractive properties of Softmax: we show that Sparsemax is also simple to evaluate, it is even cheaper to differentiate, and that it can be turned into a convex loss function.
We present the Sparsemax loss, a new loss function that is the Sparsemax analogue of logistic regression.We show that it is convex, everywhere d-ifferentiable, and can be regarded as a multi-class generalization of the Huber classification loss, an important tool in robust statistics (Zhang and Tong, 2004).We apply the Sparsemax loss to train multilabel linear classifiers.Finally, we use a Sparsemax layer over the output of a non-linear projection of the generated matching vector for classification.

Dataset
To evaluate the performance of our model, we conducted our experiments on Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015).The dataset, which consists of 549,367/9,842/9,824 premise-hypothesis pairs for train/dev/test sets and target label indicating their relation.Each pair consists of a premise and a hypothesis, manually labeled with one the labels EN-TAILMENT, CONTRADICTION, or NEUTRAL.We used the provided training, development, and test splits.

Hyper-Parameter Settings
In this section, we provide details about training the neural network.The model is implemented using open-source framework the TensorFlow (Abadi et al., 2015).The training objective of our model is cross-entropy loss, and we use mini-batch stochastic gradient descent (SGD) with the Rmsprop (Hinton, 2012) for optimization.We set the batch size to 128, the initial learning rate to 3e-4 and l 2 regularizer strength to 3e-5, and train each model for 60 epochs, and fix dropout rate at 0.3 for all dropout layers.In our neural layers, we used pretrained 300D Glove 840B vectors (Pennington et al., 2014) to initialize the word embedding.Out-of-vocabulary words in the training set are randomly initialized by sampling values uniformly from (0.02, 0.02).All of these embedding are not updated during training.Each hyper-parameter setting was run on a single machine with 10 asynchronous gradient-update threads, using Adagrad (Duchi et al., 2011) for optimization.

Related Work
Language inference or entailment recognition can be viewed as a task of sentence pair modeling.Most neural networks in this field involve a sentence-level model, followed by one or a few matching modules.
Our method is motivated by the central role played by sentence-level modeling (Yin and Schutze, 2015;Mou et al., 2016;Wan et al., 2015;Parikh et al., 2016;YangLiu et al., 2016;Rocktaschel et al., 2016) and previous approaches to semantic encoder (Graves et al., 2014;Weston et al., 2015;Sukhbaatar et al., 2015;Kumar et al., 2016;Bahdanau et al., 2015).(Yin and Schutze, 2015) and (Mou et al., 2016) apply convolutional neural networks (CNNs) as the individual sentence model, where a set of feature detectors over successive words are designed to extract local features.(Wan et al., 2015) and (Yan-gLiu et al., 2016) build sentence pair models upon recurrent neural networks (RNNs) to iteratively integrate information along a sentence.The neural counterpart to sentence similarity modeling, attention and external memory, which are the key part of our approach, was originally proposed and has been predominantly used to attempt to extend deep neural networks with an external memory (NTM) (Graves et al., 2014).NTM implements a centralized controller and a fixed-sized random access memory.The controller uses attention mechanisms to access the memory.The work of (Sukhbaatar et al., 2015) combines the soft attention with Memory Networks (MemNNs) (Graves et al., 2014).Although MemNNs are designed with non-writable memories, it constructed layered memory representations and showed promising results on both artificial and real question answering tasks.Another variation of MemNNs is Dynamic Memory Network (DMN) (Kumar et al., 2016) which is equipped with an episodic memory and seems to be flexible in different settings.
In contrast, our use of external memory is based on variable sized semantic encoder and our method use the attention mechanism to access the external memory.The size of the memory can be altered depending on the input length, i.e., we use a larger memory for long sequences and a smaller memory for short sequences.Our models are suitable for N-LI and can be trained easily by any gradient descent optimizer.

Conclusion and future work
In this paper, we proposed attention memory networks (AMNs) to solve the natural language inference (NLI) problem.Firstly, we present the attention memory neural network (AMNN) that uses attention mechanism and has a variable sized semantic memory.AMNN captures sentence-level semantics; then we directly model the relation with combining two sentence vectors to aggregate information between premise and hypothesis.Finally, we introduce the Sparsemax, a new activation function similar to the traditional Softmax, but is able to output sparse probability distributions.We use the Sparsemax layer over the generated matching vector for output.The attention memory networks (AMNs) over the premise provides further improvements to the predictive abilities of the system, resulting in a new state-of-the-art accuracy for natural language inference on the Stanford Natural Language Inference corpus.
Our model can be easily adapted to other sentence-matching models.There are several directions for our future work: (1) Employ this architecture on other sentence matching tasks such as Text Summarization, Paraphrase Text Similarity and Question Answer etc. (2) Try more heuristics matching methods to make full use of the individual sentence vectors.(3) Extend AMNN to produce encoding memory and representation vector of entire documents.

Figure 1 :
Figure 1: High-level architectures of attention memory neural networks.(a) The sentence encoding layer: Individual sentence modeling via AMNN.(b) The sentence matching layer: Sentence pair modeling, after which a Sparsemax layer is applied for output.

Table 1 :
Three NLI examples from SNLI.Relations between a Premise and a Hypothesis: Entailment, Contradiction, and Neutral (irrelevant).

Table 2 :
Train/test accuracies on the SNLI dataset and the approximate number of trained parameters (excluding embeddings) for