Recurrent Memory Networks for Language Modeling

Recurrent Neural Networks (RNNs) have obtained excellent result in many natural language processing (NLP) tasks. However, understanding and interpreting the source of this success remains a challenge. In this paper, we propose Recurrent Memory Network (RMN), a novel RNN architecture, that not only am-pliﬁes the power of RNN but also facilitates our understanding of its internal functioning and allows us to discover underlying patterns in data. We demonstrate the power of RMN on language modeling and sentence completion tasks. On language modeling, RMN out-performs Long Short-Term Memory (LSTM) network on three large German, Italian, and English dataset. Additionally we perform in-depth analysis of various linguistic dimensions that RMN captures. On Sentence Completion Challenge, for which it is essential to capture sentence coherence, our RMN obtains 69.2% accuracy, surpassing the previous state of the art by a large margin. 1


Introduction
Recurrent Neural Networks (RNNs) (Elman, 1990;Mikolov et al., 2010) are remarkably powerful models for sequential data. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), a specific architecture of RNN, has a track record of success in many natural language processing tasks such as language modeling (Józefowicz et al., 2015), dependency parsing (Dyer et al., 2015), sentence com-Within the context of natural language processing, a common assumption is that LSTMs are able to capture certain linguistic phenomena. Evidence supporting this assumption mainly comes from evaluating LSTMs in downstream applications: Bowman et al. (2015) carefully design two artificial datasets where sentences have explicit recursive structures. They show empirically that while processing the input linearly, LSTMs can implicitly exploit recursive structures of languages. Filippova et al. (2015) find that using explicit syntactic features within LSTMs in their sentence compression model hurts the performance of overall system. They then hypothesize that a basic LSTM is powerful enough to capture syntactic aspects which are useful for compression.
To understand and explain which linguistic dimensions are captured by an LSTM is non-trivial. This is due to the fact that the sequences of input histories are compressed into several dense vectors by the LSTM's components whose purposes with respect to representing linguistic information is not evident. To our knowledge, the only attempt to better understand the reasons of an LSTM's performance and limitations is the work of Karpathy et al. (2015) by means of visualization experiments and cell activation statistics in the context of character-level language modeling.
Our work is motivated by the difficulty in understanding and interpreting existing RNN architectures from a linguistic point of view. We propose Recurrent Memory Network (RMN), a novel RNN architecture that combines the strengths of both LSTM and Memory Network (Sukhbaatar et al., 2015). In RMN, the Memory Block component-a variant of Memory Network-accesses the most recent input words and selectively attends to words that are relevant for predicting the next word given the current LSTM state. By looking at the attention distribution over history words, our RMN allows us not only to interpret the results but also to discover underlying dependencies present in the data.
In this paper, we make the following contributions: 1. We propose a novel RNN architecture that complements LSTM in language modeling. We demonstrate that our RMN outperforms competitive LSTM baselines in terms of perplexity on three large German, Italian, and English datasets.
2. We perform an analysis along various linguistic dimensions that our model captures. This is possible only because the Memory Block allows us to look into its internal states and its explicit use of additional inputs at each time step.
3. We show that, with a simple modification, our RMN can be successfully applied to NLP tasks other than language modeling. On the Sentence Completion Challenge (Zweig and Burges, 2012), our model achieves an impressive 69.2% accuracy, surpassing the previous state of the art 58.9% by a large margin.

Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have shown impressive performances on many sequential modeling tasks due to their ability to encode unbounded input histories. However, training simple RNNs is difficult because of the vanishing and exploding gradient problems (Bengio et al., 1994;Pascanu et al., 2013). A simple and effective solution for exploding gradients is gradient clipping proposed by Pascanu et al. (2013). To address the more challenging problem of vanishing gradients, several variants of RNNs have been proposed. Among them, Long Short-Term Memory (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit  are widely regarded as the most successful variants.
In this work, we focus on LSTMs because they have been shown to outperform GRUs on language modeling tasks (Józefowicz et al., 2015). In the following, we will detail the LSTM architecture used in this work. Long Short-Term Memory Notation: Throughout this paper, we denote matrices, vectors, and scalars using bold uppercase (e. g., W), bold lowercase (e. g., b) and lowercase (e. g., α) letters, respectively. The LSTM used in this work is specified as follows: where x t is the input vector at time step t, h t−1 is the LSTM hidden state at the previous time step, W * and b * are weights and biases. The symbol denotes the Hadamard product or element-wise multiplication.
Despite the popularity of LSTM in sequential modeling, its design is not straightforward to justify and understanding why it works remains a challenge (Hermans and Schrauwen, 2013;Chung et al., 2014;Greff et al., 2015;Józefowicz et al., 2015;Karpathy et al., 2015). There have been few recent attempts to understand the components of an LSTM from an empirical point of view: Greff et al. (2015) carry out a large-scale experiment of eight LSTM variants. The results from their 5,400 experimental runs suggest that forget gates and output gates are the most critical components of LSTMs. Józefowicz et al. (2015) conduct and evaluate over ten thousand RNN architectures and find that the initialization of the forget gate bias is crucial to the LSTM's performance. While these findings are important to help choosing appropriate LSTM architectures, they do not shed light on what information is captured by the hidden states of an LSTM. Bowman et al. (2015) show that a vanilla LSTM, such as described above, performs reasonably well compared to a recursive neural network (Socher et al., 2011) that explicitly exploits tree structures on two artificial datasets. They find that LSTMs can effectively exploit recursive structure in the artificial datasets. In contrast to these simple datasets containing a few logical operations in their experiments, natural languages exhibit highly complex patterns. The extent to which linguistic assumptions about syntactic structures and compositional semantics are reflected in LSTMs is rather poorly understood. Thus it is desirable to have a more principled mechanism allowing us to inspect recurrent architectures from a linguistic perspective. In the following section, we propose such a mechanism.

Recurrent Memory Network
It has been demonstrated that RNNs can retain input information over a long period. However, existing RNN architectures make it difficult to analyze what information is exactly retained at their hidden states at each time step, especially when the data has complex underlying structures, which is common in natural language. Motivated by this difficulty, we propose a novel RNN architecture called Recurrent Memory Network (RMN). On linguistic data, the RMN allows us not only to qualify which linguistic information is preserved over time and why this is the case but also to discover dependencies within the data (Section 5). Our RMN consists of two components: an LSTM and a Memory Block (MB) (Section 3.1). The MB takes the hidden state of the LSTM and compares it to the most recent inputs using an attention mechanism (Gregor et al., 2015;Graves et al., 2014). Thus, analyzing the attention weights of a trained model can give us valuable insight into the information that is retained over time in the LSTM.
In the following, we describe in detail the MB architecture and the combination of the MB and the LSTM to form an RMN.

Memory Block
The Memory Block ( Figure 1) is a variant of Memory Network (Sukhbaatar et al., 2015) with one hop (or a single-layer Memory Network). At time step t, the MB receives two inputs: the hidden state h t of the LSTM and a set {x i } of n most recent words including the current word x t . We refer to n as the memory size. Internally, the MB consists of Figure 1: A graphical representation of the MB. two lookup tables M and C of size |V | × d, where |V | is the size of the vocabulary. With a slight abuse of notation we denote M i = M({x i }) and C i = C({x i }) as n × d matrices where each row corresponds to an input memory embedding m i and an output memory embedding c i of each element of the set {x i }. We use the matrix M i to compute an attention distribution over the set {x i }: When dealing with data that exhibits a strong temporal relationship, such as natural language, an additional temporal matrix T ∈ R n×d can be used to bias attention with respect to the position of the data points. In this case, equation 1 becomes We then use the attention distribution p t to compute a context vector representation of {x i }: Finally, we combine the context vector s t and the hidden state h t by a function g(·) to obtain the output h m t of the MB. Instead of using a simple addition function g(s t , h t ) = s t + h t as in Sukhbaatar et al. (2015), we propose to use a gating unit that decides how much it should trust the hidden state h t and context s t at time step t. Our gating unit is a form of Gated Recurrent Unit Chung et al., 2014): where z t is an update gate, r t is a reset gate. The choice of the composition function g(·) is crucial for the MB especially when one of its input comes from the LSTM. The simple addition function might overwrite the information within the LSTM's hidden state and therefore prevent the MB from keeping track of information in the distant past. The gating function, on the other hand, can control the degree of information that flows from the LSTM to the MB's output.

RMN Architectures
As explained above, our proposed MB receives the hidden state of the LSTM as one of its input. This leads to an intuitive combination of the two units by stacking the MB on top of the LSTM. We call this architecture Recurrent-Memory (RM). The RM architecture, however, does not allow interaction between Memory Blocks at different time steps. To enable this interaction we can stack one more LSTM layer on top of the RM. We call this architecture Recurrent-Memory-Recurrent (RMR).

Language Model Experiments
Language models play a crucial role in many NLP applications such as machine translation and speech recognition. Language modeling also serves as a standard test bed for newly proposed models (Sukhbaatar et al., 2015;Kalchbrenner et al., 2015). We conjecture that, by explicitly accessing history words, RMNs will offer better predictive power than the existing recurrent architectures. We therefore evaluate our RMN architectures against state-of-theart LSTMs in terms of perplexity.

Data
We evaluate our models on three languages: English, German, and Italian. We are especially interested in German and Italian because of their larger vocabularies and complex agreement patterns. Table 1 summarizes the data used in our experiments.

Lang Train Dev
Test  (Bojar et al., 2015). For German, we use the first 6M tokens from the News Commentary data and 16M tokens from News Crawl 2014 for training. For development and test data we use the remaining part of the News Commentary data concatenated with the WMT 2009-2014 test sets. Finally, for Italian, we use a selection of 29M tokens from the PAISÀ corpus (Lyding et al., 2014), mainly including Wikipedia pages and, to a minor extent, Wikibooks and Wikinews documents. For development and test we randomly draw documents from the same corpus.

Setup
Our baselines are a 5-gram language model with Kneser-Ney smoothing, a Memory Network (MemN) (Sukhbaatar et al., 2015), a vanilla singlelayer LSTM, and two stacked LSTMs with two and three layers respectively. N-gram models have been used intensively in many applications for their excellent performance and fast training. Chen et al. (2015) show that n-gram model outperforms a popular feed-forward language model (Bengio et al., 2003) on a one billion word benchmark (Chelba et al., 2013). While taking longer time to train, RNNs have been proven superior to n-gram models.
We compare these baselines with our two model architectures: RMR and RM. For each of our models, we consider two settings: with or without temporal matrix (+tM or -tM), and linear vs. gating composition function. In total, we experiment with eight RMN variants.
For all neural network models, we set the dimension of word embeddings, the LSTM hidden states, its gates, the memory input, and output embeddings to 128. The memory size is set to 15. The bias of the LSTM's forget gate is initialized to 1 (Józefowicz et al., 2015) while all other parameters are initialized uniformly in (−0.05, 0.05). The initial learning rate is set to 1 and is halved at each epoch after the forth epoch. All models are trained for 15 epochs with standard stochastic gradient descent (SGD). During training, we rescale the gradients whenever their norm is greater than 5 (Pascanu et al., 2013).
Sentences with the same length are grouped into buckets. Then, mini-batches of 20 sentences are drawn from each bucket. We do not use truncated back-propagation through time, instead gradients are fully back-propagated from the end of each sentence to its beginning. When feeding in a new minibatch, the hidden states of LSTMs are reset to zeros, which ensures that the data is properly modeled at the sentence level. For our RMN models, instead of using padding, at time step t < n, we use a slice T[1 : t] ∈ R t×d of the temporal matrix T ∈ R n×d .

Results
Perplexities on the test data are given in Table 2. All RMN variants largely outperform n-gram and MemN models, and most RMN variants also outperform the competitive LSTM baselines. The best results overall are obtained by RM with temporal matrix and gating composition (+tM-g).
Our results agree with the hypothesis of mitigating prediction error by explicitly using the last n words in RNNs (Karpathy et al., 2015). We further observe that using a temporal matrix always benefits the RM architectures. This can be explained by seeing the RM as a principled way to combine an LSTM and a neural n-gram model. By contrast, RMR works better without temporal matrix but its  overall performance is not as good as RM. This suggests that we need a better mechanism to address the interaction between MBs, which we leave to future work. Finally, the proposed gating composition function outperforms the linear one in most cases. For historical reasons, we also run a stacked threelayer LSTM and a RM(+tM-g) on the much smaller Penn Treebank dataset (Marcus et al., 1993) with the same setting described above. The respective perplexities are 126.1 and 123.5.

Attention Analysis
The goal of our RMN design is twofold: (i) to obtain better predictive power and (ii) to facilitate understanding of the model and discover patterns in data. In Section 4, we have validated the predictive power of the RMN and below we investigate the source of this performance based on linguistic assumptions of word co-occurrences and dependency structures.

Positional and lexical analysis
As a first step towards understanding RMN, we look at the average attention weights of each history word position in the MB of our two best model variants ( Figure 3). One can see that the attention mass tends to concentrate at the rightmost position (the current word) and decreases when moving further to the left (less recent words). This is not surprising since the success of n-gram language models has demonstrated that the most recent words provide important information for predicting the next word. Between the two variants, the RM average attention mass is less concentrated to the right. This can be explained by the absence of an LSTM layer on top, meaning that the MB in the RM architecture has to pay more attention to the more distant words in the past. The remaining analyses described below are performed on the RM(+tM-g) architecture as this yields the best perplexity results overall. Beyond average attention weights, we are interested in those cases where attention focuses on distant positions. To this end, we randomly sample 100 words from test data and visualize attention distributions over the last 15 words. Figure 4 shows the attention distributions for random samples of German and Italian. Again, in many cases attention weights concentrate around the last word (bottom row). However, we observe that many long distance words also receive noticeable attention mass. Interestingly, for many predicted words, attention is distributed evenly over memory positions, possibly in- dicating cases where the LSTM state already contains enough information to predict the next word.
To explain the long-distance dependencies, we first hypothesize that our RMN mostly memorizes frequent co-occurrences. We run the RM(+tM-g) model on the German development and test sentences, and select those pairs of (most-attendedword, word-to-predict) where the MB's attention concentrates on a word more than six positions to the left. Then, for each set of pairs with equal distance, we compute the mean frequency of corresponding co-occurrences seen in the training data ( Table 3). The lack of correlation between frequency and memory location suggests that RMN does more than simply memorizing frequent co-occurrences.  Table 3: Mean frequency (µ) of (most-attendedword, word-to-predict) pairs grouped by relative distance (d).
Previous work (Hermans and Schrauwen, 2013;Karpathy et al., 2015) studied this property of LSTMs by analyzing simple cases of closing brackets. By contrast RMN allows us to discover more interesting dependencies in the data. We manually inspect those high-frequency pairs to see whether they display certain linguistic phenomena. We observe that RMN captures, for example, separable verbs and fixed expressions in German. Separable verbs are frequent in German: they typically consist of preposition+verb constructions, such ab+hängen ('to depend') or aus+schließen ('to exclude'), and can be spelled together (abhängen) or apart as in 'hängen von der Situation ab' ('depend on the situation'), depending on the grammatical construction. Figure 5a shows a long-dependency example for the separable verb abhängen (to depend). When predicting the verb's particle ab, the model correctly attends to the verb's core hängt occurring seven words to the left. Figure 5b and 5c show fixed expression examples from German and Italian, respectively: schlüsselrolle ... spielen (play a key role) and insignito ... titolo (awarded title). Here too, the model correctly attends to the key word despite its long distance from the word to predict. ab (-1.8) und (-2.1) , (-2.5) . (-2.7) von (-2.8) (a) wie wirksam die daraus resultierende strategie sein wird , hängt daher von der genauigkeit dieser annahmen Gloss: how effective the from-that resulting strategy be will, depends therefore on the accuracy of-these measures Translation: how effective the resulting strategy will be, therefore, depends on the accuracy of these measures spielen (-1.9) gewinnen (-3.0) finden (-3.4) haben (-3.4) schaffen (-3.4) … die lage versetzen werden , eine schlüsselrolle bei der eindämmung der regionalen ambitionen chinas zu Gloss: … the position place will, a key-role in the curbing of-the regional ambitions China's to Translation: …which will put him in a position to play a key role in curbing the regional ambitions of China titolo (-2.9) re (-3.0) <unk> (-3.1) leone (-3.6) ... che fu insignito nel 1692 dall' Imperatore Leopoldo I del Other interesting examples found by the RMN in the test data include: German: findet statt (takes place), kehrte zurück (came back), fragen antworten (questions answers), kämpfen gegen (fight against), bleibt erhalten (remains intact), verantwortung ubernimmt (takes responsibility); Italian: sinistra destra (left right), latitudine longitudine (latitude longitude), collegata tramite (connected through), sposò figli (got-married children), insignito titolo (awarded title).

Syntactic analysis
It has been conjectured that RNNs, and LSTMs in particular, model text so well because they capture syntactic structure implicitly. Unfortunately this has been hard to prove, but with our RMN model we can get closer to answering this important question. We produce dependency parses for our test sets using (Sennrich et al., 2013) for German and (Attardi et al., 2009) for Italian. Next we look at how much attention mass is concentrated by the RM(+tM-g) model on different dependency types. Figure 6 shows, for each language, a selection of ten dependency types that are often long-distance. 2 Dependency direction is marked by an arrow: e.g. →mod means that the word to predict is a modifier of the attended word, while mod← means that the attended word is a modifier of the word to predict. 3 White cells denote combinations of position and dependency type that were not present in the test data.
While in most of the cases closest positions are attended the most, we can see that some dependency types also receive noticeably more attention than the average (ALL) on the long-distance positions. In German, this is mostly visible for the head of separable verb particles (→avz), which nicely supports our observations in the lexical analysis (Section 5.1). Other attended dependencies include: auxiliary verbs (→aux) when predicting the second element of a complex tense (hat . . . gesagt / has said); subordinating conjunctions (konj←) when predicting the clause-final inflected verb (dass sie sagen sollten / that they should say); control verbs (→obji) when predicting the infinitive verb (versucht ihr zu helfen / tries to help her). Out of the Italian dependency types selected for their frequent longdistance occurrences (bottom of Figure 6), the most attended are argument heads (→arg), complement heads (→comp), object heads (→obj) and subjects (subj←). This suggests that RMN is mainly capturing predicate argument structure in Italian. Notice that syntactic annotation is never used to train the model, but only to analyze its predictions.
We can also use RMN to discover which complex dependency paths are important for word prediction. To mention just a few examples, high attention on [-15, -12] [ the German path [subj←,→kon,→cj] indicates that the model captures morphological agreement between coordinate clauses in non-trivial constructions of the kind: spielen die Kinder im Garten und singen / the children play in the garden and sing. In Italian, high attention on the path [→obj,→comp,→prep] denotes cases where the semantic relatedness between a verb and its object does not stop at the object's head, but percolates down to a prepositional phrase attached to it (passò buona parte della sua vita / spent a large part of his life). Interestingly, both local n-gram context and immediate dependency context would have missed these relations. While much remains to be explored, our analysis shows that RMN discovers patterns far more complex than pairs of opening and closing brackets, and suggests that the network's hidden state captures to a large extent the underlying structure of text.

Sentence Completion Challenge
The Microsoft Research Sentence Completion Challenge (Zweig and Burges, 2012) has recently be-come a test bed for advancing statistical language modeling. We choose this task to demonstrate the effectiveness of our RMN in capturing sentence coherence. The test set consists of 1,040 sentences selected from five Sherlock Holmes novels by Conan Doyle. For each sentence, a content word is removed and the task is to identify the correct missing word among five given candidates. The task is carefully designed to be non-solvable for local language models such as n-gram models. The best reported result is 58.9% accuracy  4 which is far below human accuracy of 91% (Zweig and Burges, 2012).
As baseline we use a stacked three-layer LSTM. Our models are two variants of RM(+tM-g), each consisting of three LSTM layers followed by a MB. The first variant (unidirectional-RM) uses n words preceding the word to predict, the second (bidirectional-RM) uses the n words preceding and the n words following the word to predict, as MB input. We include bidirectional-RM in the experiments to show the flexibility of utilizing future context in RMN.
We train all models on the standard training data of the challenge, which consists of 522 novels from Project Gutenberg, preprocessed similarly to (Mnih and Kavukcuoglu, 2013). After sentence splitting, tokenization and lowercasing, we randomly select 19,000 sentences for validation. Training and validation sets include 47M and 190K tokens respectively. The vocabulary size is about 64,000.
We initialize and train all the networks as described in Section 4.2. Moreover, for regularization, we place dropout (Srivastava et al., 2014) after each LSTM layer as suggested in (Pham et al., 2014). The dropout rate is set to 0.3 in all the experiments. Table 4 summarizes the results. It is worth to mention that our LSTM baseline outperforms a dependency RNN making explicit use of syntactic information (Mirowski and Vlachos, 2015) and performs on par with the best published result My morning's work has not been , since it has proved that he has the very strongest motives for standing in the way of anything of the sort a) invisible b) neglected ♦♣ c) overlooked d) wasted e) deliberate That is his fault , but on the whole he's a good worker a) main b) successful c) mother's ♣ d) generous e) favourite ♦ Figure 7: Examples of sentence completion. The correct option is in boldface. Predictions by the LSTM baseline and by our best RMN model are marked by ♦ and ♣ respectively.  Table 4: Accuracy on 1,040 test sentences. We use perplexity to choose the best model. Dimension of word embeddings, LSTM hidden states, and gate g parameters are set to d.
bring additional advantage to the model. Mnih and Kavukcuoglu (2013) also report a similar observation. We believe that RMN may achieve further improvements with hyper-parameter optimization. Figure 7 shows some examples where our best RMN beats the already very competitive LSTM baseline, or where both models fail. We can see that in some sentences the necessary clues to predict the correct word occur only to its right. While this seems to conflict with the worse result obtained by the bidirectional-RM, it is important to realize that prediction corresponds to the whole sentence probability. Therefore a badly chosen word can have a negative effect on the score of future words. This appears to be particularly true for the RMN due to its ability to directly access (distant) words in the history. The better performance of unidirectional ver-sus bidirectional-RM may indicate that the attention in the memory block can be distributed reliably only on words that have been already seen and summarized by the current LSTM state. In future work, we may investigate whether different ways to combine two RMNs running in opposite directions further improve accuracy on this challenging task.

Conclusion
We have proposed the Recurrent Memory Network (RMN), a novel recurrent architecture for language modeling. Our RMN outperforms LSTMs in terms of perplexity on three large dataset and allows us to analyze its behavior from a linguistic perspective. We find that RMNs learn important co-occurrences regardless of their distance. Even more interestingly, our RMN implicitly captures certain dependency types that are important for word prediction, despite being trained without any syntactic information. Finally RMNs obtain excellent performance at modeling sentence coherence, setting a new state of the art on the challenging sentence completion task.