Revisiting Simple Neural Probabilistic Language Models

Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of Bengio et al. (2003), which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline Transformer with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM’s local concatenation layer, which results in small but consistent perplexity decreases across three word-level language modeling datasets.


Introduction
Over the past decade, state-of-the-art neural architectures for language modeling (LM) have transitioned from simple recurrent neural networks (Mikolov et al., 2011) to LSTMs (Zaremba et al., 2014) and finally to Transformers (Vaswani et al., 2017). This progress is not due solely to LMspecific advances, however, as general-purpose upgrades such as residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) have enabled scaling to huge datasets and model sizes (Kaplan et al., 2020) on powerful GPUs.
In this paper, we revisit the neural probabilistic language model (NPLM) of Bengio et al. (2003), the first (and simplest) neural architecture proposed for language modeling, through the lens of modern architecture design, hardware, and optimization. Given an input sequence of tokens, the NPLM first concatenates the previous n token embeddings and  Bengio et al. (2003), which concatenates token embeddings within a fixed local window and feeds them to a stack of feed-forward layers to predict the next token. Our modified version additionally concatenates representations of the distant context, which are computed by applying a weighted average to token representations outside the local window.
then passes the result through a feed-forward network to predict the next token. Due to its small context window and lack of parameter sharing, the NPLM has been rendered obsolete, discarded in favor of LSTMs and Transformers.
To what extent are its limitations mitigated by modern design and optimization choices? To answer this question, we design an upgraded NPLM featuring increased depth and window size n that incorporates residual connections, layer normalization, and dropout. We also include global context representations to the concatenation layer by applying simple aggregation functions to embeddings outside of the local context window. These modifications substantially improve the NPLM: on the WIKITEXT-103 benchmark dataset, the original NPLM of Bengio et al. (2003) reaches a validation perplexity of 216, compared to 31.7 for our implementation, and 25.0 for a Transformer baseline.
Can we improve Transformer language models by hybridizing them with NPLMs? Interestingly, we discover that our NPLM actually outperforms the Transformer when given shorter input contexts ( Figure 2), although it is unable to take full advantage of longer contexts. Inspired by this result, we create two simple variants of the Transformer, one in which the first self-attention layer is replaced with the NPLM's concatenation layer, and the other in which self-attention in the first layer is constrained to a small local window. 1 These adjustments result in small but consistent perplexity decreases compared to a baseline Transformer across three word-level language modeling datasets (the first variant obtains 24.1 validation perplexity on WIKITEXT-103). Our qualitative analysis shows that the modified Transformers are better at predicting rare tokens and named entities, especially those that have already appeared in the context.

Neural probabilistic language models
Modern neural language models (NLMs) compute the conditional probability of a token w t given preceding (or prefix) tokens w <t by first computing a dense vector representation of the prefix and then feeding it into a classifier to predict the next word. More concretely, a composition function g is applied to the sequence of token embeddings x <t associated with the prefix, which results in a dense vector z = g(x <t ). A softmax classifier then takes z as input and produces a distribution P (w t | w <t ) over the vocabulary. Transformers (Vaswani et al., 2017) are currently the most popular choice for the composition function g.
NPLM definition: First introduced by Bengio et al. (2003), the NPLM uses a simple composition function reminiscent of n-gram language modeling. It concatenates the last k prefix embeddings and passes the result through a feed-forward layer: The NPLM has many intuitive limitations: (1) it ignores the global context provided by prefix tokens further than k tokens away; (2) it uses a different set of parameters for each position in the prefix window; and (3) it has a relatively small number of parameters, which limits its expressivity.

A modern update to the NPLM
To what extent are these limitations mitigated after scaling up the NPLM using modern advances in  Table 1).

Increased depth and dimensionality:
We pass the concatenated representation into a multi-layer network instead of a single layer, and we also substantially increase the embedding and hidden layer dimensionality to 410 and 2100 respectively. WIKITEXT-103 validation perplexity drops from 216 for the original one-layer NPLM (32M parameters) to 41.9 for a 16-layer NPLM with 148M parameters (no global prefix embeddings).
Better optimization for deep networks: To improve gradient flow across the multi-layer network, we apply residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) at each layer. We additionally apply dropout (Srivastava et al., 2014), use rectified linear units (ReLU) instead of the tanh non-linearity, and train our NPLM with the Adam optimizer (Kingma and Ba, 2015). 4 These modifications are crucial for training our 16-layer NPLM: without residual connections, we reach a perplexity of 660, while using standard SGD instead of Adam yields a perplexity of 418.5.
Increased window size: While hardware considerations limited the window size k of the original NPLM to just five tokens, modern GPUs allow us to quickly train models with much larger memory footprints. We train models up to k = 50 ( Figure 2) and observe perplexity drop from 87 with k = 3 to eventually plateau around 40 with k = 50. The plot also shows that Transformers take far better advantage of longer inputs.
Tied weights and adaptive softmax: The original NPLM computes probabilities of all words in the vocabulary. For datasets with a large vocabulary, we use adaptive softmax (Grave et al., 2017) to speed up training and decrease the memory footprint. We also tie token embeddings with weights in the softmax layer (Press and Wolf, 2017) to further reduce model size. Without these modifications, our 16-layer NPLM does not fit in GPU memory, precluding training. 5 Global context representation: Prior research demonstrates the effectiveness of representing large chunks of text using averaged token embeddings (Iyyer et al., 2015;Wieting et al., 2016). We leverage this work by applying a simple learned kernel (i.e., a 1-D convolution) to the prefix embeddings (beyond just the previous k) and including the resulting vector as an extra embedding to the concatenation layer. We also experiment with replacing the learned kernel with a uniform average. Adding these simple global embeddings improves the NPLM considerably: our 16-layer model's perplexity drops from 41.9 to 31.7 with the kernel-derived embedding, while the uniform average achieves a perplexity of 37.7.

Using NPLMs to improve Transformers
While our upgraded NPLM achieves a massive perplexity reduction compared to the original implementation, it is still ∼ 6 perplexity points short of the baseline Transformer LM. Are there any takeaways from our results that can be used to improve Transformer LMs? In this section, we begin with an analysis experiment on WIKITEXT-103 that shows NPLMs outperform Transformers when given shorter prefixes. Inspired by this result, we propose two variants of a Transformer LM that integrate elements of the NPLM, and discover that both of them decrease perplexity across three word-level language modeling datasets ( Table 2).

NPLMs are better with short contexts
Since NPLMs only concatenate a small, fixed number of prefix tokens together, they are obviously 5 Our models are trained on 4 GeForce GTX 1080Ti GPUs. unsuited to handle global context. While our upgraded variant addresses this issue to some extent by including aggregated global prefix embeddings into the concatenation layer, the perplexity gap between NPLMs and Transformer LMs remains large. Here, we attempt to understand how much of this difference can be attributed to the Transformer's ability to better model global context. In particular, we train different NPLM and Transformer LMs by truncating the input prefix length to between 3 and 50 tokens. Our NPLM models do not have any global context embeddings in these experiments, and both the NPLM and Transformer models are 16 layers with ∼148M parameters each. Figure 2 shows that NPLMs are actually better than Transformers when the input sequences are short (i.e., fewer than twenty prefix tokens), but as the prefixes get longer, NPLM perplexity plateaus, while the Transformer perplexity continually decreases. The plot shows that while multi-headed self-attention is effective for longer sequences, it may not be best for modeling shorter contexts.

Transformer variants
Inspired by these results, we investigate hybrid NPLM and Transformer models to better model both short and long-range contexts. In particular, we create two variants of the Transformer by modifying only its first layer (L0), while keeping every other layer the same. In the first modification, Transformer-N, we simply replace the first selfattention block in L0 with the NPLM's local concatenation layer (Equation 1), without including any global embeddings. Wondering if the behavior of the concatenation layer can be replicated by self-attention, we also design Transformer-C, in which the self-attention window in L0 is constrained to the previous 5 tokens. This constraint is similar to the windowed attention approaches pre-  viously applied at all layers in prior Transformer variants (Beltagy et al., 2020;Roy et al., 2020). 6

Models
We train 16-layer (16L) models on the larger WIKITEXT-103 and LAMBADA datasets, 12L models for ENWIK8, and 6L for the small WIKITEXT-2 dataset. 7 For each dataset, we scale embedding and hidden dimensionality to ensure that all models have roughly the same number of parameters. After tuning hyperparameters on the validation data, we set the number of local concatenated tokens to 15 and the number of 1-D convolution kernels to 5.
Training details Our NPLM is trained with dropout probability p = 0.2, while the other models use p = 0.1 on all datasets except for WIKITEXT-2, for which they use p = 0.3. For all models, we use the Adam optimizer with β 1 = 0.9 and β 2 = 0.999, and training is conducted on 1080Ti GPUs. During evaluation, we follow the 6 We do not observe improvements when using local attention at all layers. 7 The relatively high WIKITEXT-2 perplexities are likely because we did not apply separate regularization that Merity et al. (2017) show is useful for such a small dataset. methodology of (Khandelwal et al., 2020) by providing extra prior context for the scored tokens, for instance, in a block of 512 tokens, only the last 128 tokens are scored with the first 384 tokens as context. Detailed architecture, training, and evaluation configurations are included in Appendix B. Table 2 shows that Transformer-N improves over the baseline Transformer across all three word-level language modeling benchmarks, with the biggest perplexity drop coming on the small WIKITEXT-2 dataset, although character-level perplexity on ENWIK8 is unchanged. Transformer-C also outperforms the baseline Transformer but by smaller margins than Transformer-N.

Results and analysis
Narrower window size in L0 is better: We examine WIKITEXT-103 val. perplexity as a function of Transformer-C window size. Figure 3 shows drops of ∼ 1 perplexity point with window sizes of 2-4, which disappear as window size is increased. This experiment supports the importance of focusing on local context at lower layers.
Hybrid models improve at predicting entities and rare words: To obtain a more fine-grained understanding of our models, we turn to the long-distance dependency prediction task in LAM-BADA (Paperno et al., 2016), a manually-annotated subset of the full dataset in which correctly predict-  Table 3: NPLM and Transformer variants on LAM-BADA target word accuracy (%). Variants perform better on context-frequent (CF) tokens that appear at least twice in previous context, low frequency (LF) tokens with frequency < 1500, and named entities (Ent).
ing a token is possible only when longer contexts are provided. Table 3 shows that our upgraded NPLM achieves less than 1% accuracy (argmax prediction) on the test set but 30% on a control set that does not test long-term dependencies. As the baseline Transformer reaches over 30% accuracy on the test set, this result shows that the convolutional kernels in our modernized NPLM are incompetent at modeling long-range context.
On the other hand, both Transformer-N and Transformer-C outperform the baseline Transformer (Table 3) by over 1.5% on the test set. To better understand these improvements, we perform a fine-grained analysis of the tokens for which these models improve over the Transformer. This analysis reveals that the gains stem mainly from three types of target tokens: (1) context-freqeunt (CF) tokens that appear more than twice in the prefix; (2) low frequency tokens (LF) with frequency below 1500; and (3) named entity tokens (Ent) detected by the spaCy (Honnibal et al., 2020) NER tagger. The three right-most columns of Table 3 shows that both Transformer variants are more accurate at predicting these tokens, which demonstrates the benefits of enforcing local focus at the first layer.

Related work
The NPLM model in this paper based entirely on the original formulation from Bengio et al. (2003). The variants in our analysis are based on the Transformer model (Vaswani et al., 2017) and Transformer LMs (Baevski and Auli, 2019;Dehghani et al., 2019;Dai et al., 2019;Sukhbaatar et al., 2019;Khandelwal et al., 2020;Wang et al., 2019;Press et al., 2020a;Mandava et al., 2020;Press et al., 2020b). The constrained local attention in Transformer-C is adopted at all layers of models such as Longformer (Beltagy et al., 2020) and Big Bird (Zaheer et al., 2020) due to its sparsity. Our work conceptually resembles that of Chiu and Rush (2020), who modernize HMM language models, as well as simple RNN-based language models (Merity et al., 2018). Our linguistic analysis is inspired by experiments from Khandelwal et al. (2018).

Conclusion
We discover that general-purpose advances in neural architecture design, hardware, and optimization significantly improve the NPLM, a classic language model. An analysis of our upgraded NPLM inspires us to hybridize it with a modern Transformer LM and obtain perplexity decreases across three word-level LM datasets.

Ethics statement
Misuse of language models Our research involves training large language models on publicly available benchmark datasets. They share the same issues faced by many pretrained language models, such as being used maliciously to generate unfaithful, biased or offensive output.
Energy costs We train our models and variants on 4 GeForce GTX 1080 Ti GPUs for all datasets except WIKITEXT-2. We use only one GPU for experiments on WIKITEXT-2. The Transformer and its variants take longer to train (40h, 102h, and 108h on WIKITEXT-103, LAMBADA, and EN-WIK8 respectively). Our modernized NPLM does not have attention module, and therefore trains relatively faster (32h, 45h, and 88h for the above datasets). The energy costs of training and tuning these models, as well as doing exploratory experiments in the initial stages of the project, cannot be ignored. That said, compared to Transformer models, the modernized NPLM has significantly reduced training time, and hence carbon costs. We hope our work contains useful insights for future research that aims to develop simpler and more efficient language models.