Long Short-Term Memory-Networks for Machine Reading

Machine reading, the automatic understanding of text, remains a challenging task of great value for NLP applications. We propose a machine reader which processes text incrementally from left to right, while linking the current word to previous words stored in memory and implicitly discovering lexical dependencies facilitating understanding. The reader is equipped with a Long Short-Term Memory architecture, which differs from previous work in that it has a memory tape (instead of a memory cell) for adaptively storing past information without severe information compression. We also integrate our reader with a new attention mechanism in encoder-decoder architecture. Experiments on language modeling, sentiment analysis, and natural language inference show that our model matches or outperforms the state of the art.


Introduction
How can a sequence-level network induce relations which are presumed latent during text processing? How can a recurrent network attentively memorize longer sequences in a way that humans do? In this paper we design a machine reader that automatically learns to understand text. The term machine reading is related to a wide range of tasks from answering reading comprehension questions (Clark et al., 2013), to fact and relation extraction , ontology learning (Poon and Domingos, 2010), and textual entailment (Dagan et al., 2005). Rather than focusing on a specific task, we develop a general-purpose reading simula-tor, drawing inspiration from human language processing and the fact language comprehension is incremental with readers continuously extracting the meaning of utterances on a word-by-word basis.
In order to understand texts, our machine reader should provide facilities for extracting and representing meaning from natural language text, storing meanings internally, and working with stored meanings to derive further consequences. Ideally, such a system should be robust, open-domain, and degrade gracefully in the presence of semantic representations which may be incomplete, inaccurate, or incomprehensible. It would also be desirable to simulate the behavior of English speakers who process text sequentially, from left to right, fixating nearly every word while they read (Rayner, 1998) and creating partial representations for sentence prefixes (Konieczny, 2000;Tanenhaus et al., 1995).
Language modeling tools such as recurrent neural networks (RNN) bode well with human reading behavior (Frank and Bod, 2011). RNNs treat each sentence as a sequence of words and recursively compose each word with its previous memory, until the meaning of the whole sentence has been derived. In practice, however, sequence-level networks are met with at least three challenges. The first one concerns model training problems associated with vanishing and exploding gradients (Hochreiter, 1991;Bengio et al., 1994), which can be partially ameliorated with gated activation functions, such as the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), and gradient clipping (Pascanu et al., 2013). The second issue relates to memory compression problems. As the input sequence gets compressed and blended into a single dense vector, suf-551 The FBI is chasing a criminal on the run . The The FBI is chasing a criminal on the run . The The FBI FBI is chasing a criminal on the run . The The FBI FBI is is chasing a criminal on the run . The The FBI FBI is is chasing chasing a criminal on the run .
The The FBI FBI is is chasing chasing a a criminal on the run .
The The FBI FBI is is chasing chasing a a criminal criminal on the run .
The The FBI FBI is is chasing chasing a a criminal criminal on on the run .
The The FBI FBI is is chasing chasing a a criminal criminal on on the the run .
The The FBI FBI is is chasing chasing a a criminal criminal on on the the run run . Figure 1: Illustration of our model while reading the sentence The FBI is chasing a criminal on the run.
Color red represents the current word being fixated, blue represents memories. Shading indicates the degree of memory activation.
ficiently large memory capacity is required to store past information. As a result, the network generalizes poorly to long sequences while wasting memory on shorter ones. Finally, it should be acknowledged that sequence-level networks lack a mechanism for handling the structure of the input. This imposes an inductive bias which is at odds with the fact that language has inherent structure. In this paper, we develop a text processing system which addresses these limitations while maintaining the incremental, generative property of a recurrent language model. Recent attempts to render neural networks more structure aware have seen the incorporation of external memories in the context of recurrent neural networks Sukhbaatar et al., 2015;. The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing. The attention acts as a weak inductive module discovering relations between input tokens, and is trained without direct supervision. As a point of departure from previous work, the memory network we employ is internal to the recurrence, thus strengthening the interaction of the two and leading to a representation learner which is able to rea-son over shallow structures. The resulting model, which we term Long Short-Term Memory-Network (LSTMN), is a reading simulator that can be used for sequence processing tasks. Figure 1 illustrates the reading behavior of the LSTMN. The model processes text incrementally while learning which past tokens in the memory and to what extent they relate to the current token being processed. As a result, the model induces undirected relations among tokens as an intermediate step of learning representations. We validate the performance of the LSTMN in language modeling, sentiment analysis, and natural language inference. In all cases, we train LSTMN models end-to-end with task-specific supervision signals, achieving performance comparable or better to state-of-the-art models and superior to vanilla LSTMs.

Related Work
Our machine reader is a recurrent neural network exhibiting two important properties: it is incremental, simulating human behavior, and performs shallow structure reasoning over input streams.
Recurrent neural network (RNNs) have been successfully applied to various sequence modeling and sequence-to-sequence transduction tasks. The latter have assumed several guises in the literature such as machine translation , sentence compression (Rush et al., 2015), and reading comprehension . A key contributing factor to their success has been the ability to handle well-known problems with exploding or vanishing gradients (Bengio et al., 1994), leading to models with gated activation functions (Hochreiter and Schmidhuber, 1997;, and more advanced architectures that enhance the information flow within the network (Koutník et al., 2014;Chung et al., 2015;Yao et al., 2015).
A remaining practical bottleneck for RNNs is memory compression : since the inputs are recursively combined into a single memory representation which is typically too small in terms of parameters, it becomes difficult to accurately memorize sequences (Zaremba and Sutskever, 2014). In the encoder-decoder architecture, this problem can be sidestepped with an attention mechanism which learns soft alignments between the decoding states and the encoded memories . In our model, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens.
The idea of introducing a structural bias to neural models is by no means new. For example, it is reflected in the work of Socher et al. (2013a) who apply recursive neural networks for learning natural language representations. In the context of recurrent neural networks, efforts to build modular, structured neural models date back to Das et al. (1992) who connect a recurrent neural network with an external memory stack for learning context free grammars. Recently,  propose Memory Networks to explicitly segregate memory storage from the computation of neural networks in general. Their model is trained end-to-end with a memory addressing mechanism closely related to soft attention (Sukhbaatar et al., 2015) and has been applied to machine translation (Meng et al., 2015).  define a set of differentiable data structures (stacks, queues, and dequeues) as memories controlled by a recurrent neural network. Tran et al. (2016) combine the LSTM with an external memory block component which interacts with its hidden state. Kumar et al. (2016) employ a structured neural network with episodic memory modules for natural language and also visual question answering (Xiong et al., 2016).
Similar to the above work, we leverage memory and attention in a recurrent neural network for inducing relations between tokens as a module in a larger network responsible for representation learning. As a property of soft attention, all intermediate relations we aim to capture are soft and differentiable. This is in contrast to shift-reduce type neural models Bowman et al., 2016) where the intermediate decisions are hard and induction is more difficult. Finally, note that our model captures undirected lexical relations and is thus distinct from work on dependency grammar induction (Klein and Manning, 2004) where the learned head-modifier relations are directed.

The Machine Reader
In this section we present our machine reader which is designed to process structured input while retaining the incrementality of a recurrent neural network. The core of our model is a Long Short-Term Mem-ory (LSTM) unit with an extended memory tape that explicitly simulates the human memory span. The model performs implicit relation analysis between tokens with an attention-based memory addressing mechanism at every time step. In the following, we first review the standard Long Short-Term Memory and then describe our model.

Long Short-Term Memory
A Long Short-Term Memory (LSTM) recurrent neural network processes a variable-length sequence x = (x 1 , x 2 , · · · , x n ) by incrementally adding new content into a single memory slot, with gates controlling the extent to which new content should be memorized, old content should be erased, and current content should be exposed. At time step t, the memory c t and the hidden state h t are updated with the following equations: (1) where i, f , and o are gate activations. Compared to the standard RNN, the LSTM uses additive memory updates and it separates the memory c from the hidden state h, which interacts with the environment when making predictions.

Long Short-Term Memory-Network
The first question that arises with LSTMs is the extent to which they are able to memorize sequences under recursive compression. LSTMs can produce a list of state representations during composition, however, the next state is always computed from the current state. That is to say, given the current state h t , the next state h t+1 is conditionally independent of states h 1 · · · h t 1 and tokens x 1 · · · x t . While the recursive state update is performed in a Markov manner, it is assumed that LSTMs maintain unbounded memory (i.e., the current state alone summarizes well the tokens it has seen so far). This assumption may fail in practice, for example when the sequence is long or when the memory size is not large enough. Another undesired property of LSTMs concerns modeling structured input. An LSTM aggregates information on a token-by-token basis in sequential order, but there is no explicit mechanism for reasoning over structure and modeling relations between tokens.
Our model aims to address both limitations. Our solution is to modify the standard LSTM structure by replacing the memory cell with a memory network . The resulting Long Short-Term Memory-Network (LSTMN) stores the contextual representation of each input token with a unique memory slot and the size of the memory grows with time until an upper bound of the memory span is reached. This design enables the LSTM to reason about relations between tokens with a neural attention layer and then perform non-Markov state updates. Although it is feasible to apply both write and read operations to the memories with attention, we concentrate on the latter. We conceptualize the read operation as attentively linking the current token to previous memories and selecting useful content when processing it. Although not the focus of this work, the significance of the write operation can be analogously justified as a way of incrementally updating previous memories, e.g., to correct wrong interpretations when processing garden path sentences (Ferreira and Henderson, 1991).
The architecture of the LSTMN is shown in Figure 2 and the formal definition is provided as follows. The model maintains two sets of vectors stored in a hidden state tape used to interact with the environment (e.g., computing attention), and a memory tape used to represent what is actually stored in memory. 1 Therefore, each token is associated with a hidden vector and a memory vector. Let x t denote the current input; C t 1 = (c 1 , · · · , c t 1 ) denotes the current memory tape, and H t 1 = (h 1 , · · · , h t 1 ) the previous hidden tape. At time step t, the model computes the relation between x t and x 1 · · · x t 1 through h 1 · · · h t 1 with an attention layer: This yields a probability distribution over the hidden state vectors of previous tokens. We can then compute an adaptive summary vector for the previous hidden tape and memory tape denoted byc t andh t , respectively: and use them for computing the values of c t and h t in the recurrent update as: where v, W h , W x and Wh are the new weight terms of the network.
A key idea behind the LSTMN is to use attention for inducing relations between tokens. These relations are soft and differentiable, and components of a larger representation learning network. Although it is appealing to provide direct supervision for the attention layer, e.g., with evidence collected from a dependency treebank, we treat it as a submodule being optimized within the larger network in a downstream task. It is also possible to have a more structured relational reasoning module by stacking multiple memory and hidden layers in an alternating fashion, resembling a stacked LSTM (Graves, 2013) or a multi-hop memory network (Sukhbaatar et al., 2015). This can be achieved by feeding the output h k t of the lower layer k as input to the upper layer (k + 1). The attention at the (k + 1)th layer is computed as: Skip-connections (Graves, 2013) can be applied to feed x t to upper layers as well.

Modeling Two Sequences with LSTMN
Natural language processing tasks such as machine translation and textual entailment are concerned with modeling two sequences rather than a single one. A standard tool for modeling two sequences with recurrent networks is the encoder-decoder architecture where the second sequence (also known as the target) is being processed conditioned on the first one (also known as the source). In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.
Shallow Attention Fusion Shallow fusion simply treats the LSTMN as a separate module that can be readily used in an encoder-decoder architecture, in lieu of a standard RNN or LSTM. As shown in Figure 3a, both encoder and decoder are modeled as LSTMNs with intra-attention. Meanwhile, interattention is triggered when the decoder reads a target token, similar to the inter-attention introduced in .

Deep Attention Fusion Deep fusion combines inter-and intra-attention (initiated by the decoder)
when computing state updates. We use different notation to represent the two sets of attention. Following Section 3.2, C and H denote the target memory tape and hidden tape, which store representations of the target symbols that have been processed so far. The computation of intra-attention follows Equations (4)-(9). Additionally, we use A = [a 1 , · · · , a m ] and Y = [g 1 , · · · , g m ] to represent the source memory tape and hidden tape, with m being the length of the source sequence conditioned upon. We compute inter-attention between the input at time step t and tokens in the entire source sequence as follows: After that we compute the adaptive representation of the source memory tapeã t and hidden tapeg t as: We can then transfer the adaptive source representationã t to the target memory with another gating operation r t , analogous to the gates in Equation (7).
The new target memory includes inter-alignment r t ã t , intra-relation f t c t , and the new input information i t ĉ t : As shown in the equations above and Figure 3b, the major change of deep fusion lies in the recurrent storage of the inter-alignment vector in the target memory network, as a way to help the target network review source information.

Experiments
In this section we present our experiments for evaluating the performance of the LSTMN machine reader. We start with language modeling as it is a natural testbed for our model. We then assess the model's ability to extract meaning representations for generic sentence classification tasks such as sentiment analysis. Finally, we examine whether the LSTMN can recognize the semantic relationship between two sentences by applying it to a natural language inference task. Our code is available at https://github.com/cheng6076/ SNLI-attention.    Table 1: Language model perplexity on the Penn Treebank. The size of memory is 300 for all models.

Language Modeling
Our language modeling experiments were conducted on the English Penn Treebank dataset. Following common practice (Mikolov et al., 2010), we trained on sections 0-20 (1M words), used sections 21-22 for validation (80K words), and sections 23-24 (90K words for testing). The dataset contains approximately 1 million tokens and a vocabulary size of 10K. The average sentence length is 21. We use perplexity as our evaluation metric: where NLL denotes the negative log likelihood of the entire test set and T the corresponding number of tokens. We used stochastic gradient descent for optimization with an initial learning rate of 0.65, which decays by a factor of 0.85 per epoch if no significant improvement has been observed on the validation set. We renormalize the gradient if its norm is greater than 5. The mini-batch size was set to 40. The dimensions of the word embeddings were set to 150 for all models.
In this suite of experiments we compared the LSTMN against a variety of baselines. The first one is a Kneser-Ney 5-gram language model (KN5) which generally serves as a non-neural baseline for the language modeling task. We also present perplexity results for the standard RNN and LSTM models. We also implemented more sophisticated LSTM architectures, such as a stacked LSTM (sLSTM), a gated-feedback LSTM (gLSTM; Chung et al. (2015)) and a depth-gated LSTM (dLSTM; Yao et al. (2015)). The gated-feedback LSTM has feedback gates connecting the hidden states across multiple time steps as an adaptive control of the information flow. The depth-gated LSTM uses a depth gate to connect memory cells of vertically adjacent layers. In general, both gLSTM and dLSTM are able to capture long-term dependencies to some degree, but they do not explicitly keep past memories. We set the number of layers to 3 in this experiment, mainly to agree with the language modeling experiments of Chung et al. (2015). Also note that that there are no single-layer variants for gLSTM and dLSTM; they have to be implemented as multi-layer systems. The hidden unit size of the LSTMN and all comparison models (except KN5) was set to 300.
The results of the language modeling task are shown in Table 1. Perplexity results for KN5 and RNN are taken from Mikolov et al. (2015). As can be seen, the single-layer LSTMN outperforms these he sits down at the piano and plays our view is that we may see a profit decline products < unk > have to be first to be winners everyone in the world is watching us very closely Figure 4: Examples of intra-attention (language modeling). Bold lines indicate higher attention scores. Arrows denote which word is being focused when attention is computed, but not the direction of the relation. two baselines and the LSTM by a significant margin. Amongst all deep architectures, the three-layer LSTMN also performs best. We can study the memory activation mechanism of the machine reader by visualizing the attention scores. Figure 4 shows four sentences sampled from the Penn Treebank validation set. Although we explicitly encourage the reader to attend to any memory slot, much attention focuses on recent memories. This agrees with the linguistic intuition that long-term dependencies are relatively rare. As illustrated in Figure 4 the model captures some valid lexical relations (e.g., the dependency between sits and at, sits and plays, everyone and is, is and watching). Note that arcs here are undirected and are different from the directed arcs denoting head-modifier relations in dependency graphs.

Sentiment Analysis
Our second task concerns the prediction of sentiment labels of sentences. We used the Stanford Sentiment Treebank (Socher et al., 2013a), which contains fine-grained sentiment labels (very positive, positive, neutral, negative, very negative) for 11,855 sentences. Following previous work on this dataset,
we used 8,544 sentences for training, 1,101 for validation, and 2,210 for testing. The average sentence length is 19.1. In addition, we also performed a binary classification task (positive, negative) after removing the neutral label. This resulted in 6,920 sentences for training, 872 for validation and 1,821 for testing. Table 2 reports results on both fine-grained and binary classification tasks. We experimented with 1-and 2-layer LSTMNs. For the latter model, we predict the sentiment label of the sentence based on the averaged hidden vector passed to a 2-layer neural network classifier with ReLU as the activation function. The memory size for both LSTMN models was set to 168 to be compatible with previous LSTM models (Tai et al., 2015) applied to the same task. We used pretrained 300-D Glove 840B vectors (Pennington et al., 2014) to initialize the word embeddings. The gradient for words with Glove embeddings, was scaled by 0.35 in the first epoch after which all word embeddings were updated normally.
We used Adam (Kingma and Ba, 2015) for optimization with the two momentum parameters set to 0.9 and 0.999 respectively. The initial learning rate was set to 2E-3. The regularization constant was 1E-4 and the mini-batch size was 5. A dropout rate of 0.5 was applied to the neural network classifier. We compared our model with a wide range of topperforming systems. Most of these models (including ours) are LSTM variants (third block in Table 2), recursive neural networks (first block), or convolu-tional neural networks (CNNs; second block). Recursive models assume the input sentences are represented as parse trees and can take advantage of annotations at the phrase level. LSTM-type models and CNNs are trained on sequential input, with the exception of CT-LSTM (Tai et al., 2015) which operates over tree-structured network topologies such as constituent trees. For comparison, we also report the performance of the paragraph vector model (PV; Le and Mikolov (2014); see Table 2, second block) which neither operates on trees nor sequences but learns distributed document representations parameterized directly.
The results in Table 2 show that both 1-and 2-layer LSTMNs outperform the LSTM baselines while achieving numbers comparable to state of the art. The number of layers for our models was set to be comparable to previously published results. On the fine-grained and binary classification tasks our 2-layer LSTMN performs close to the best system T-CNN (Lei et al., 2015). Figure 5 shows examples of intra-attention for sentiment words. Interestingly, the network learns to associate sentiment important words such as though and fantastic or not and good.

Natural Language Inference
The ability to reason about the semantic relationship between two sentences is an integral part of text understanding. We therefore evaluate our model on recognizing textual entailment, i.e., whether two premise-hypothesis pairs are entailing, contradictory, or neutral. For this task we used the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015), which contains premisehypothesis pairs and target labels indicating their relation. After removing sentences with unknown labels, we end up with 549,367 pairs for training, 9,842 for development and 9,824 for testing. The vocabulary size is 36,809 and the average sentence length is 22. We performed lower-casing and tokenization for the entire dataset.
Recent approaches use two sequential LSTMs to encode the premise and the hypothesis respectively, and apply neural attention to reason about their logical relationship (Rocktäschel et al., 2016;Wang and Jiang, 2016). Furthermore, Rocktäschel et al. (2016) show that a non-standard encoder-decoder architecture which processes the hypothesis conditioned on it 's tough to watch but it 's a fantastic movie although i did n't hate this one , it 's not very good either Figure 5: Examples of intra-attention (sentiment analysis). Bold lines (red) indicate attention between sentiment important words.
the premise results significantly boosts performance. We use a similar approach to tackle this task with LSTMNs. Specifically, we use two LSTMNs to read the premise and hypothesis, and then match them by comparing their hidden state tapes. We perform average pooling for the hidden state tape of each LSTMN, and concatenate the two averages to form the input to a 2-layer neural network classifier with ReLU as the activation function.
We used pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) to initialize the word embeddings. Out-of-vocabulary (OOV) words were initialized randomly with Gaussian samples (µ=0, s=1). We only updated OOV vectors in the first epoch, after which all word embeddings were updated normally. The dropout rate was selected from [0.1, 0.2, 0.3, 0.4]. We used Adam (Kingma and Ba, 2015) for optimization with the two momentum parameters set to 0.9 and 0.999 respectively, and the initial learning rate set to 1E-3. The mini-batch size was set to 16 or 32. For a fair comparison against previous work, we report results with different hidden/memory dimensions (i.e., 100, 300, and 450).
We compared variants of our model against different types of LSTMs (see the second block in Table 3). Specifically, these include a model which encodes the premise and hypothesis independently with two LSTMs (Bowman et al., 2015), a shared LSTM (Rocktäschel et al., 2016), a word-by-word attention model (Rocktäschel et al., 2016), and a matching LSTM (mLSTM; Wang and Jiang (2016)). This model sequentially processes the hypothesis, and at each position tries to match the current word with an attention-weighted representation of the premise (rather than basing its predictions on whole sentence embeddings). We also compared our mod-Models h |q| M Test BOW concatenation --59.8 LSTM (Bowman et al., 2015) 100 221k 77.6 LSTM-att (Rocktäschel et al., 2016) 100 252k 83.5 mLSTM (Wang and Jiang, 2016) 300 1.  els with a bag-of-words baseline which averages the pre-trained embeddings for the words in each sentence and concatenates them to create features for a logistic regression classifier (first block in Table 3). LSTMNs achieve better performance compared to LSTMs (with and without attention; 2nd block in Table 3). We also observe that fusion is generally beneficial, and that deep fusion slightly improves over shallow fusion. One explanation is that with deep fusion the inter-attention vectors are recurrently memorized by the decoder with a gating operation, which also improves the information flow of the network. With standard training, our deep fusion yields the state-of-the-art performance in this task. Although encouraging, this result should be interpreted with caution since our model has substantially more parameters compared to related systems. We could compare different models using the same number of total parameters. However, this would inevitably introduce other biases, e.g., the number of hyper-parameters would become different.

Conclusions
In this paper we proposed a machine reading simulator to address the limitations of recurrent neural networks when processing inherently structured input. Our model is based on a Long Short-Term Memory architecture embedded with a memory network, explicitly storing contextual representations of input tokens without recursively compressing them. More importantly, an intra-attention mechanism is employed for memory addressing, as a way to in-duce undirected relations among tokens. The attention layer is not optimized with a direct supervision signal but with the entire network in downstream tasks. Experimental results across three tasks show that our model yields performance comparable or superior to state of the art.
Although our experiments focused on LSTMs, the idea of building more structure aware neural models is general and can be applied to other types of networks. When direct supervision is provided, similar architectures can be adapted to tasks such as dependency parsing and relation extraction. In the future, we hope to develop more linguistically plausible neural architectures able to reason over nested structures and neural models that learn to discover compositionality with weak or indirect supervision.