Improving Neural Language Models with Weight Norm Initialization and Regularization

Embedding and projection matrices are commonly used in neural language models (NLM) as well as in other sequence processing networks that operate on large vocabularies. We examine such matrices in fine-tuned language models and observe that a NLM learns word vectors whose norms are related to the word frequencies. We show that by initializing the weight norms with scaled log word counts, together with other techniques, lower perplexities can be obtained in early epochs of training. We also introduce a weight norm regularization loss term, whose hyperparameters are tuned via a grid search. With this method, we are able to significantly improve perplexities on two word-level language modeling tasks (without dynamic evaluation): from 54.44 to 53.16 on Penn Treebank (PTB) and from 61.45 to 60.13 on WikiText-2 (WT2).


Introduction
A language model (LM) measures how likely a certain sequence of words is for a given language. It does so by calculating the probability of occurrence of that sequence, which can be learned from monolingual text data. Many models in machine translation and automatic speech recognition benefit from the use of a LM (Corazza et al., 1995;Peter et al., 2017).
While count-based LMs (Katz, 1987;Kneser and Ney, 1995) provided the best results in the past, substantial improvements were achieved with the introduction of neural networks in the field of language modeling (Bengio et al., 2003). Different types of architectures such as feedforward neural networks (Schwenk, 2007) and recurrent neural networks (Mikolov et al., 2010) have since been used for language modeling. Currently, variants of long short-term memory * Equal contribution. Ordering determined by coin flipping.
In natural language processing, words are typically represented by high-dimensional one-hot vectors. To reduce dimensionality and to be able to learn relationships between words, they are mapped into a lower-dimensional, continuous embedding space. Mathematically, this is done by multiplying the one-hot vector with the embedding matrix. Similarly, to receive a probability distribution over the vocabulary, a mapping from an embedding space is performed by a projection matrix followed by a softmax operation. These two matrices can be tied together in order to reduce the number of parameters and improve the results of NLMs (Inan et al., 2017;Press and Wolf, 2017).
Since the row vectors in the embedding and projection matrices are effectively word vectors in a continuous space, we investigate such weight vectors in well-trained and fine-tuned NLMs. We observe that the learned word vector generally has a greater norm for a frequent word than an infrequent word. We then specifically examine the weight vector norm distribution and design initialization and normalization strategies to improve NLMs.
Our contribution is twofold: • We identify that word vectors learned by NLMs have a weight norm distribution that resembles logarithm of the word counts. We then correspondingly develop a weight initialization strategy to aid NLM training.
• We design a weight norm regularization loss term that increases the generalization ability of the model. Applying this loss term, we achieve state-of-the-art results on Penn Treebank (PTB) and WikiText-2 (WT2) language modeling tasks.
93 Melis et al. (2018) investigated different NLM architectures and regularization methods with the use of a black-box hyperparameter tuner. In particular, the LSTM architecture was compared to two more recent recurrent approaches, namely recurrent highway networks (Zilly et al., 2017) and neural architecture search (Zoph and Le, 2017). They found that the standard LSTM architecture outperforms other models, if properly regularized. Merity et al. (2017a) used various regularization methods such as activation regularization (Merity et al., 2017b) in a LSTM model. They also introduced a variant of the averaged stochastic gradient method, where the averaging trigger is not tuned by the user but relies on a non-monotonic condition instead. With these and further regularization and optimization methods, improved results on PTB and WT2 were achieved.
To further improve this network architecture, Yang et al. (2018) introduced the mixture of softmaxes (MoS) model, claiming that the calculation of the output probabilities with a single softmax layer is a bottleneck. In their approach, several output probabilities are calculated and then combined via a weighted sum. The LSTM-MoS architecture provides state-of-the-art results on PTB and WT2 at the time of writing and is used as the baseline model for comparisons in this work.
Other works proposed to tie the embedding and projection matrices. Press and Wolf (2017) investigated the effects of weight tying, analyzed update rules after tying and showed that tied matrices evolve in a similar way as the projection matrix. Inan et al. (2017) were motivated by the fact that with a classification setup over the vocabulary, inter-word information is not utilized to its full potential. They also provided theoretical justification on why it is appropriate to tie the above-mentioned matrices.
Besides using the word embedding matrix, there are other approaches to represent word sequences. Zhang et al. (2015) proposed a new embedding method called fixed-sized ordinally-forgetting encoding (FOFE), which allows them to encode variable-length sentences into fixed-length vectors almost uniquely.
Additionally, Salimans and Kingma (2016) introduced a weight normalization reparametrization trick on weight matrices, which separates the norm and the angle of a vector. This can speed up the convergence of stochastic gradient descent and also allows for explicit scaling of gradients in the amplitude and direction. They also discussed the connections between weight normalization and batch normalization.
On top of one-hot representations of words, Irie et al. (2015) used additional information to represent word sequences. It is shown that the use of long-context bag-of-words as additional feature for language modeling can narrow the gap between feed-forward NLMs and recurrent NLMs.

Neural Language Modeling
In NLM the probability of a word sequence so that the (n − 1) preceding words x j−1 j−n+1 are considered for the prediction of the next word x j . This is typically done by using a recurrent neural network, e.g. a stack of LSTM layers, to encode the input sequence as where E T is the transposed embedding matrix, [x t−n+1 , x t−n+2 , ..., x t−1 ] are the one-hot encoded preceding words and the LSTM() function returns the last hidden state of the last LSTM layer. The probability distribution over the next word x t is then calculated as with V being the vocabulary size, k = 1, 2, ..., V , and W k being the k-th row vector in the projection matrix W . For training the neural network, the cross-entropy error criterion, which is equivalent to the maximum likelihood criterion, is used. For the i-th sequence of words x t i 1 , the cross-entropy loss L i is defined as with y i being the true label of x t i . The total loss is then calculated as where N is the total number of sequences. A language model is normally scored by perplexity (ppl). For a given test corpus x T 1 = x 1 x 2 ...x T , the ppl is calculated as which is a measurement on how likely a given sentence is, according to the prediction of the model. In the above formulation, we have an embedding matrix E and a projection matrix W . When the two matrices are tied and one-hot vectors are used to represent words, the rows of these matrices are then the word vectors of the corresponding words. Particularly, we focus on the norms of the row vectors and study their relationship with word counts and how to regularize them.

Weight Norm Initialization
We first train models on PTB and WT2 as described in (Yang et al., 2018) and plot the norms of learned weight vectors of the embedding matrix in Figure 1.
When the words are ranked by their counts and placed on the x-axis from frequent to infrequent, it can be seen that the word vector norms follow a downward trend as well. Log unigram counts are also plotted for comparison. As can be seen, the norm distribution follows a similar trend as the log counts. It is important to note, that the logit for word x k and context h t is calculated as W k h t (see Equation 3), which can be rewritten as where θ denotes the angle between W k and h t . Therefore, one intuition from the aforementioned observation is that, for a frequent word, the network tends to learn a weight vector W k with a greater norm to maximize likelihood. This motivates our approach to initialize the weight norms with scaled log counts rather than uniformly random values in a specific range. Because we wish to initialize the weight norms explicitly with scaled logarithm of the word counts, it is helpful to look at a weight vector's magnitude and direction separately. For this purpose, we use a reparameterization technique on the weight vectors as described in (Salimans and Kingma, 2016): where k = 1, 2, ..., V , g k = W k 2 , and v k is a vector proportional to W k . Reparameterizing the weight vectors makes it easy to implement the weight norm initialization as where c k denotes unigram word count for word k and σ is a scalar applied to the log counts. We sample each component of v k from a continuous uniform distribution in [−r, r], where r is a hyperparameter, specifying the initialization range. With this, no constraint on the weight vector direction is imposed during initialization. Additionally, we adopt an adaptive gradient strategy which regularizes the gradients in g k . As in when epoch t is no greater than a specified epoch τ , ( ∂L ∂g k ) -the regularized gradient in g k , linearly decays to γ (γ ≤ 1) times the unregularized gradient ∂L ∂g k . Otherwise, we directly use the discounted gradient. In analogy to learning rate decay, this adaptive gradient strategy anneals the word vector norm updates in each step. The intuition for such a strategy is that after a certain amount of epochs, the weight norms should not change so drastically from the initialized scaled log counts.

Weight Norm Regularization
Weight regularization (WR) is a well established method to combat overfitting in neural networks, which is especially important on smaller datasets (Krogh and Hertz, 1992). The idea is to push weights in the network to zero, where gradients are not significant. Typically, WR is implemented by adding an extra term to the loss function L 0 , which penalizes the norm of all weights in the network. For example, L 2 -regularization is implemented as with the sum going over all weights w in the network and λ being the regularization strength. However, this method is not perfect, as it affects every weight in the network equally and may lead to hidden units' weights getting stuck near zero. In this work we add a constraint specifically on the embedding and projection matrices, whose weights are shared. Since the row vectors in both matrices are word vectors, it seems appropriate to put constraints explicitly on their norms instead of on each individual weight parameter in the matrices.
We propose to add a regularization term to the standard loss function L 0 in the form of where ν, ρ ≥ 0 are two scalars and W j is the j-th row vector of the projection matrix W . The L 2norms of the row vectors are pushed towards ν, while ρ is the regularization strength. This will punish the row vectors for adopting norms other than ν, in the hope of reducing the effect of overfitting on the training data. The choice of a soft regularization loss term instead of hard-fixing the weight norms in the forward pass is motivated by the weight norm distribution shown in Figure 1. It can be seen that NLMs tend to learn non-equal weight norms for words with different counts. Therefore, hardfixing weight norms may limit the network's ability to learn.

Experiment Setup
The experiments are conducted on two popular language modeling datasets. The number of tokens and size of vocabulary for each dataset are summarized in Table 1   The smaller one is the PTB corpus with preprocessing from Mikolov et al. (2010), which has a comparatively small vocabulary size of 10k. With a smaller number of sentences, this dataset is a good choice for performing optimization of hyperparameters. The second corpus WT2, which was introduced by Merity et al. (2016), has over three times the vocabulary size of PTB.
We use the network structure introduced by Yang et al. (2018) with the same hyper-parameter values to ensure comparability. Several regularization techniques are used in this setup, such as dropout and weight decay. Furthermore, the embedding and projection matrices are tied by default. For optimization, we adopt the same strategy as described in (Merity et al., 2017a). That is, a conservative non-monotonic criterion is used to switch from stochastic gradient descent (SGD) to averaged stochastic gradient descent (ASGD) (Polyak and Juditsky, 1992). For more details of the network structure refer to (Yang et al., 2018).

Weight Norm Initialization
We tune the hyperparameter σ and use a value of σ = 0.5 to scale the logarithm of word counts. Initialization range r is set to 0.1 for both the reparametrized direction vectors and the baseline word vectors. Empirically, we set γ = 0.1 and τ = 100 for the adaptive gradient method. Per- WikiText-2 corpus with the resulting test perplexities shown in Table 3 and Table 4 respectively. plexities on both PTB and WT2 in early epochs, as well as the relative perplexity improvement over baseline models are summarized in Table 2.
First, we notice significant improvement after the first epoch of training using weight norm initialization. About 10% of perplexity reduction is achieved on both datasets. This could be beneficial, when one wants to train on large datasets and/or can only train for a limited number of epochs. Second, the perplexity improvements decay down to around 1% after 40 epochs. This is in agreement with our expectation, because apart from reduced gradient in g k , a weight norm initialized model is not fundamentally different from the baseline model and no major difference should be seen if we train for long enough. It is important to note that with only weight norm initialization, both models eventually converge to perplexities that are slightly worse than the baseline. We also notice that the epochs, after which the optimizer is switched from SGD to ASGD, are different in weight norm initialized models and baseline models.   (Yang et al., 2018). † indicates the use of dynamic evaluation.   (Yang et al., 2018). † indicates the use of dynamic evaluation.

Weight Norm Regularization
In order to tune the hyperparameters ρ and ν introduced in Section 5, we perform a grid search over the PTB dataset, the results of which are shown in Figure 2. If the norm constraint ν becomes too large, perplexity worsens significantly, as seen in the case of ν = 64. A model with a ν-value of 2 provides the best result in most cases. We hypothesize that a value of ν that is too small results in the logit being close to zero as shown in Equation 7. For the regularization strength ρ, we recognize that ρ = 10 −3 gives the best result on the PTB test data. Larger or smaller values can hurt the performance of the system, depending also on the value of ν. It should be noted that the optimized value of ρ is significantly larger than the scaling s wd of the weight decay term, which was optimized to be 1.2×10 −6 by Merity et al. (2017a).
The resulting weight norm distributions of the projection matrices' row vectors are shown in Figure 3a and Figure 3b for models trained on PTB and WT2 respectively. Our efforts of pushing the norms to a value of ν = 2.0 resulted in a noticeably smaller average norm, as well as in a overall more narrow distribution.
With the tuned parameter values ρ = 10 −3 and ν = 2.0 we improve the previous state-of-the-art result by 1.28 ppl on PTB and by 1.32 ppl on WT2 (without considering dynamic evaluation (Krause et al., 2018), see Table 3 and Table 4). This is achieved without increasing the number of trainable parameters in the network or slowing down the training process.

Conclusion
Word embedding matrix and output projection matrix are important components in LSTM-based LMs. They are also widely used in other NLP models where one-hot vectors of words need to be mapped into lower dimensional space. Given the one-hot nature of word representations, row vectors in such matrices are then the correspond-ing word vectors. We study specifically the norms of these learned word vectors, the distribution of the norms, and the relationship with word counts. We show that with a simple initialization strategy together with a reparametrization technique, it is possible to get significantly lower perplexity in early epochs during training. By using a weight norm regularization loss term, we are able to obtain significant improvements on standard language modeling tasks -2.4% ppl reduction on PTB and 2.1% on WT2.
We propose three directions to investigate further. First, in this work we use scaled logarithm of word counts to initialize the weight norms. It is a logical next step to use smoothing techniques on the word counts and study the effects of such initializations. Second, we currently apply the same norm constraint on different words. Altering the loss function and regularizing the weight norms to word counts (and smoothed word counts) is worth examining as well. Finally, our focus so far is on weight norms. It is a more exciting and challenging task to study the pairwise inner products, and single out the effects of angular differences.
We also plan to expand our regularization and initialization techniques to the field of neural machine translation. Embedding and projection matrices are also present in neural machine translation networks, which could potentially benefit from our methods as well. It seems natural to use our methods on the transformer architecture introduced by Vaswani et al. (2017), in which the embedding matrices at source and target sides, plus the projection matrix, are three-way tied.

Acknowledgments
This work has received funding from the European Research Council (ERC) (under the European Union's Horizon 2020 research and innovation programme, grant agreement No 694537, project "SEQCLAS") and the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/8-1, project "CoreTec"). The GPU computing cluster was supported by DFG (Deutsche Forschungsgemeinschaft) under grant INST 222/1168-1 FUGG.
The work reflects only the authors' views and none of the funding agencies is responsible for any use that may be made of the information it contains.