Improved Language Modeling by Decoding the Past

Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method improves perplexity on the Penn Treebank dataset by up to 1.8 points and by up to 2.3 points on the WikiText-2 dataset, over strong regularized baselines using a single softmax. With a mixture-of-softmax model, we show gains of up to 1.0 perplexity points on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling.


INTRODUCTION
Language modeling is a fundamental task in natural language processing. Given a sequence of tokens, its joint probability distribution can be modeled using the auto-regressive conditional factorization. This leads to a convenient formulation where a language model has to predict the next token given a sequence of tokens as context. Recurrent neural networks are an effective way to compute distributed representations of the context by sequentially operating on the embeddings of the tokens. These representations can then be used to predict the next token as a probability distribution over a fixed vocabulary using a linear decoder followed by Softmax.
Starting from the work of Mikolov et al. (2010), there has been a long list of works that seek to improve language modeling performance using more sophisticated recurrent neural networks (RNNs) (Zaremba et al. (2014); Zilly et al. (2017); Zoph & Le (2016); Mujika et al. (2017)). However, in more recent work vanilla LSTMs (Hochreiter & Schmidhuber (1997)) with relatively large number of parameters have been shown to achieve state-of-the-art performance on several standard benchmark datasets both in word-level and character-level perplexity (Merity et al. (2018a;b); Melis et al. (2018)). A key component in these models is the use of several forms of regularization e.g. variational dropout on the token embeddings (Gal & Ghahramani (2016)), dropout on the hidden-tohidden weights in the LSTM (Wan et al. (2013)), norm regularization on the outputs of the LSTM and classical dropout (Srivastava et al. (2014)). By carefully tuning the hyperparameters associated with these regularizers combined with optimization algorithms like NT-ASGD (a variant of the Averaged SGD), it is possible to achieve very good performance. Each of these regularizations address different parts of the LSTM model and are general techniques that could be applied to any other sequence modeling problem.
In this paper, we propose a regularization technique that is specific to language modeling. One unique aspect of language modeling using LSTMs (or any RNN) is that at each time step t, the model takes as input a particular token x t from a vocabulary W and using the hidden state of the LSTM (which encodes the context till x t ) predicts a probability distribution w t+1 on the next token x t+1 over the same vocabulary as output. Since x t can be mapped to a trivial probability distribution over W , this operation can be interpreted as transforming distributions over W (Inan et al. (2016)). Clearly, the output distribution is dependent on and is a function of x t and the context further in the past and encodes information about it. We ask the following question -How much information is it possible to decode about the input distribution (and hence x t ) from the output distribution w t+1 ?
In general, it is impossible to decode x t unambiguously. Even if the language model is perfect and predicts x t+1 with probability 1, there could be many tokens preceding it. However, in this case the number of possibilities for x t will be limited, as dictated by the bigram statistics of the corpus and the language in general. We argue that biasing the language model such that it is possible to decode more information about the past tokens from the predicted next token distribution is beneficial. We incorporate this intuition into a regularization term in the loss function of the language model. The symmetry in the inputs and outputs of the language model at each step lends itself to a simple decoding operation. It can be cast as a (pseudo) language modeling problem in "reverse", where the future prediction w t+1 acts as the input and the last token x t acts as the target of prediction. The token embedding matrix and weights of the linear decoder of the main language model can be reused in the past decoding operation. We only need a few extra parameters to model the nonlinear transformation performed by the LSTM, which we do by using a simple stateless layer. We compute the cross-entropy loss between the decoded distribution for the past token and x t and add it to the main loss function after suitable weighting. The extra parameters used in the past decoding are discarded during inference time. We call our method Past Decode Regularization or PDR for short.
We conduct extensive experiments on four benchmark datasets for word level and character level language modeling by combining PDR with existing LSTM based language models and achieve new state-of-the-art performance on three of them.

PAST DECODE REGULARIZATION (PDR)
Let X = (x 1 , x 2 , · · · , x t , · · · , x T ) be a sequence of tokens. In this paper, we will experiment with both word level and character level language modeling. Therefore, tokens can be either words or characters. The joint probability P (X) factorizes into Let c t = (x 1 , x 2 , · · · , x t ) denote the context available to the language model for x t+1 . Let W denote the vocabulary of tokens, each of which is embedded into a vector of dimension d. Let E denote the token embedding matrix of dimension |W | × d and e w denote the embedding of w ∈ W . An LSTM computes a distributed representation of c t in the form of its output hidden state h t , which we assume has dimension d as well. The probability that the next token is w can then be calculated using a linear decoder followed by a Softmax layer as where b w ′ is the entry corresponding to w ′ in a bias vector b of dimension |W | and | w represents projection onto w. Here we assume that the weights of the decoder are tied with the token embedding matrix E (Inan et al. (2016); Press & Wolf (2017)). To optimize the parameters of the language model θ, the loss function to be minimized during training is set as the cross-entropy between the predicted distribution P θ (w|c t ) and the actual token x t+1 .
Note that Eq.
(2), when applied to all w ∈ W produces a 1 × |W | vector w t+1 , encapsulating the prediction the language model has about the next token x t+1 . Since this is dependent on and conditioned on c t , w t+1 clearly encodes information about it; in particular about the last token x t in c t . In turn, it should be possible to infer or decode some limited information about x t from w t+1 . We argue that by biasing the model to be more accurate in recalling information about past tokens, we can help it in predicting the next token better.
To this end, we define the following decoding operation to compute a probability distribution over w c ∈ W as the last token in the context. Here f θr is a non-linear function that maps vectors in R d to vectors in R d and b ′ θr is a bias vector of dimension |W |, together with parameters θ r . In effect, we are decoding the past -the last token in the context x t . This produces a vector w r t of dimension 1 × |W |. The cross-entropy loss with respect to the actual last token x t can then be computed as Here P DR stands for Past Decode Regularization. L P DR captures the extent to which the decoded distribution of tokens differs from the actual tokens x t in the context. Note the symmetry between Eqs. (2) and (5). The "input" in the latter case is w t+1 and the "context" is provided by a nonlinear transformation of w t+1 E. Different from the former, the context in Eq. (5) does not preserve any state information across time steps as we want to decode only using w t+1 . The term w t E can be interpreted as a "soft" token embedding lookup, where the token vector w t is a probability distribution instead of a unit vector.
We add λ P DR L P DR to the loss function in Eq.
(3) as a regularization term, where λ P DR is a positive weighting coefficient, to construct the following new loss function for the language model.
Thus equivalently PDR can also be viewed as a method of defining an augmented loss function for language modeling. The choice of λ P DR dictates the degree to which we want the language model to incorporate our inductive bias i.e. decodability of the last token in the context. If it is too large, the model will fail to predict the next token, which is its primary task. If it is zero or too small, the model will retain less information about the last token which hampers its predictive performance. In practice, we choose λ P DR by a search based on validation set performance.
Note that the trainable parameters θ r associated with PDR are used only during training to bias the language model and are not used at inference time. This also means that it is important to control the complexity of the nonlinear function f θr so as not to overly bias the training. As a simple choice, we use a single fully connected layer of size d followed by a Tanh nonlinearity as f θr . This introduces few extra parameters and a small increase in training time as compared to a model not using PDR.

EXPERIMENTS
We present extensive experimental results to show the efficacy of using PDR for language modeling on four standard benchmark datasets -two each for word level and character level language modeling. For word level language modeling, we evaluate our method on the Penn Tree Bank (PTB) (Mikolov et al. (2010)) and the WikiText-2 (WT2) (Merity et al. (2016)) dataset. For character level language modeling, we use the Penn TreeBank Character (PTBC) (Mikolov et al. (2010)) dataset and the Hutter Prize Wikipedia Prize (Hutter (2018)) (also known as Enwik8) dataset. Key statistics for these datasets is presented in Table 1.
As mentioned in the introduction, some of the best existing results on these datasets are obtained by using relatively large LSTMs and using extensive regularization techniques Merity et al. (2018a;b). We apply our regularization technique to these models, the so called AWD-LSTM. We largely follow their experimental procedure and incorporate their dropouts and regularizations in our experiments. The relative contribution of these existing techniques and PDR will be analyzed later in the paper in Section 6. Each token is embedded using a token embedding matrix (initialized randomly and updated during training) with embedding dimension d. We use a 3-layer LSTM, where the hidden dimension of the last layer or the output dimension is also d. This facilitates the use of weight tying between the decoder layer and the token embedding matrix. The PDR regularization term is computed according to Eq.(4) and Eq. (5). We call our model AWD-LSTM+PDR.
For completeness, we briefly mention the set of dropouts and regularizations reused from AWD-LSTM in our experiments. They are the following.
1. Embedding dropout -Variational or locked dropout applied to the token embedding matrix. 2. Word dropout -Dropout applied to entire tokens. 3. LSTM layer dropout -Dropout between layers of the LSTM. 4. LSTM weight dropout -Dropout applied to the hidden-to-hidden connections in the LSTM. 5. LSTM output dropout -Dropout applied to the final output of the LSTM. 6. Alpha/beta regularization -Activation and temporal activation regularization applied to the LSTM states. 7. Weight decay -L2 regularization on the parameters of the model.
Note that these regularizations are applied to the input, hidden state and output of the LSTM and can be use in any sequence modeling task. Our proposed regularization PDR is specific to language modeling and acts on the predicted next-token distribution.
In addition to the 7 hyperparameters associated with the techniques above, PDR also has an associated weighting coefficient λ P DR . For our experiments, we set λ P DR = 0.001 which is determined by a coarse search on the PTB and WT2 validation sets. For the remaining ones, we perform light hyperparameter search in the vicinity of those reported for AWD-LSTM.
For both PTB and WT2, we use a 3-layered LSTM with 1150, 1150 and 400 hidden dimensions. The token (in this case word) embedding dimension is set to d = 400 . For training the models, we follow the same procedure as AWD-LSTM i.e. a combination of SGD and NT-ASGD, with an initial learning rate set to 30. The model is first trained for 750 epochs followed by finetuning. The batch size for PTB is set to 40 and that for WT2 is set to 80.
For the PTBC dataset, we use a 3-layered LSTM with 1000, 1000 and 200 hidden dimensions. The character embedding dimension is d = 200 and we use the Adam optimizer (Kingma & Ba (2015)) with a learning rate of 2.7e-3 which is decreased by a factor of 10 at epochs 300 and 400 out of a total of 500 epochs. For the Enwik8 dataset, we use a LSTM with 1850, 1850 and 400 hidden dimensions. The characters are embedded in d = 400 dimensions and we use the Adam optimizer with a learning rate of 1e-3 which is decreased by a factor of 10 at epochs 25 and 35, out of a total of 50 epochs. For each of the datasets, AWD-LSTM+PDR has less than 1% more parameters than the corresponding AWD-LSTM model during training only. The maximum time overhead due to the additional computation is less than 3%. Table 2 shows the results on PTB. Our method (AWD-LSTM+PDR) achieves a perplexity of 55.6 on the PTB test set, which improves on the current state-of-the-art (AWD-LSTM) by an absolute 1.7 points. The advantages of better information retention due to PDR is passed on when combined with a continuous cache pointer (Grave et al. (2016)), where our method yields an absolute improvement of 1.2 over AWD-LSTM. Notably, when coupled with dynamic evaluation (Krause et al. (2018)), the perplexity is decreased further of 49.3. To the best of our knowledge, ours is the first method to achieve a sub 50 perplexity on the PTB test set without the use of multiple softmaxes (Yang et al. (2017)). Note that, for both cache pointer and dynamic evaluation, we coarsely tune the associated hyperparameters on the validation set. PTB is a restrictive dataset with a vocabulary of 10K words. Achieving good perplexity requires considerable regularization. The fact that PDR can improve upon existing heavily regularized models is empirical evidence of its distinctive nature and its effectiveness in improving language models.     Table 3 shows the perplexities achieved by AWD-LSTM+PDR on WT2. This dataset is considerably more complex than PTB with a vocabulary of more than 33K words. Our method improves over the current state-of-the-art by a significant 2.3 points, achieving a perplexity of 63.5. The gains are maintained with the use of cache pointer (2.4) and with the use of dynamic evaluation (1.7).

RESULTS ON CHARACTER LEVEL LANGUAGE MODELING
The results on PTBC are shown in Table 4. Our method achieves a bits-per-character (BPC) performance of 1.169 on the PTBC test set, improving on the current state-of-the-art by 0.006 or 0.5%. It is notable that even with this highly processed dataset and a small vocabulary of only 51 tokens, our method does improve on already highly regularized models. Finally, we present results on Enwik8 in

ANALYSIS OF PDR
In this section, we analyze PDR by probing its performance in several ways and comparing it with current state-of-the-art models that do not use PDR.
To verify that indeed PDR can act as a form of regularization, we perform the following experiment. We take the models for PTB and WT2 and turn off all dropouts and regularization and compare its performance with only PDR turned on. The results, as shown in Table 6, validate the premise of PDR. The model with only PDR turned on achieves 2.4 and 5.1 better validation perplexity on PTB and WT2 as compared to the model without any regularization. Thus, biasing the LSTM by decoding the distribution of past tokens from the predicted next-token distribution can indeed act as a regularizer leading to better generalization performance.
Next, we plot histograms of the negative log-likelihoods of the correct context tokens x t in the past decoded vector w r t computed using our best models on the PTB and WT2 validation sets in Fig    1(a). Indeed, for both the datasets, the NLL values are significantly peaked near 0, which means that the past decoding operation is able to decode significant amount of information about the last token in the context.
To investigate the effect on PDR due to changing hyperparameters, we pick 60 sets of random hyperparameters in the vicinity of those reported by Merity et al. (2018a) and compute the validation set perplexity after training (without finetuning) on PTB, for both AWD-LSTM+PDR and AWD-LSTM. Their histograms are plotted in Fig.1(b). The perplexities for models with PDR are distributed slightly to the left of those without PDR. There appears to be more instances of perplexities in the higher range for models without PDR. Note that there are certainly hyperparameter settings where adding PDR leads to lower validation complexity, as is generally the case for any regularization method.

COMPARISON WITH AWD-LSTM
To show the qualitative difference between AWD-LSTM+PDR and AWD-LSTM, we compare their respective best models. In Fig.2(a), we plot a histogram of the entropy of the predicted next token distribution w t+1 for all the tokens in the validation set of PTB. The distributions for the two models is slightly different, with some identifiable patterns.  entropy of the predicted distribution when it is in the higher range of 8 and above, pushing it into the range of 5-8. This shows that one way PDR biases the language model is by reducing the entropy of the predicted next token distribution. Indeed, one way to reduce the cross-entropy between x t and w r t is by making w t+1 less spread out in Eq. (5). This tends to benefits the language model when the predictions are correct.
We also compare the training curves for the two models in Fig.2(b) on PTB. Although the two models use slightly different hyperparameters, the regularization effect of PDR is apparent with a lower validation perplexity but higher training perplexity. The corresponding trends shown in Fig.2 for WT2 have similar characteristics.

ABLATION STUDIES
We perform a set of ablation experiments on the best AWD-LSTM+PDR models for PTB and WT2 to understand the relative contribution of PDR and the other regularizations used in the model. The results are shown in Table 7. In both cases, PDR has a significant effect in decreasing the validation set performance, albeit lesser than the other forms of regularization. This is not surprising as PDR does not influence the LSTM directly.

RELATED WORK
Our method builds on the work of using sophisticated regularization techniques to train LSTMs for language modeling. In particular, the AWD-LSTM model achieves state-of-the-art performance on the four datasets considered in this paper (Merity et al. (2018a;b)). The work of Melis et al. (2018) also achieves similar results with highly regularized LSTMs. While most of the regularization techniques are applied directly to the LSTM, we use the predicted next token distribution. The symmetry in the inputs and outputs of a language model is also exploited in weight tying (Inan et al. (2016); Press & Wolf (2017)) when the token embedding dimension and LSTM output dimension are equal. Our method can be used with untied weights as well. Regularizing the training of an LSTM by combining the main objective function with auxiliary tasks has been successfully applied to several tasks in NLP (Radford et al. (2018); Rei (2017)). In fact, a popular choice for the auxiliary task is language modeling itself. This in turn is related to multi-task learning (Collobert & Weston (2008)).
Specialized architectures like Recurrent Highway Networks (Zilly et al. (2017)) and NAS (Zoph & Le (2016)) have been successfully used to achieve competitive performance in language modeling. The former one makes the hidden-to-hidden transition function more complex allowing for more refined information flow. Such architectures are especially important for character level language modeling where strong results have been shown using Fast-Slow RNNs (Mujika et al. (2017)), a two level architecture where the slowly changing recurrent network tries to capture more long range dependencies. The use of historical information can greatly help language models deal with long range dependencies as shown by (Merity et al. (2016); Krause et al. (2018)). Finally, the recent work of (Yang et al. (2017)) uses multiple Softmax functions to address the low rank problem of the decoding layer matrices.