Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Reduction of the number of parameters is one of the most important goals in Deep Learning. In this article we propose an adaptation of Doubly Stochastic Variational Inference for Automatic Relevance Determination (DSVI-ARD) for neural networks compression. We find this method to be especially useful in language modeling tasks, where large number of parameters in the input and output layers is often excessive. We also show that DSVI-ARD can be applied together with encoder-decoder weight tying allowing to achieve even better sparsity and performance. Our experiments demonstrate that more than 90% of the weights in both encoder and decoder layers can be removed with a minimal quality loss.


Introduction
The problem of neural networks compression has recently gained more interest as the number of parameters (and hence memory size) of modern neural networks increased drastically. Moreover, only a few weights prove to be relevant for prediction while the majority are de facto redundant (Han et al., 2015).
In this paper we suggest an adaptation of a Bayesian approach called Automatic Relevance Determination (ARD) for neural networks compression in language modeling tasks, where the first and the last linear layers often have enormous * These two authors contributed equally; the ordering of their names was chosen arbitrarily. The work was done when the first author was an intern at the Samsung R&D Institute. size. We derive the Doubly Stochastic Variational Inference (DSVI) algorithm for non-iid (not independent and identically distributed) objects, a common case in language modeling, and use it to perform optimization of our models.
Furthermore, we extend this approach so that it could be applied together with the weight tying technique (Press and Wolf, 2017;Inan et al., 2016), i.e., using the same set of parameters for both weight matrices of the first and the last layers, which has been proved highly efficient.

Related works
Most of the works on neural networks compression can be roughly divided into two categories: those dealing with matrix decomposition approaches (Lu et al., 2016;Arjovsky et al., 2016;Tjandra et al., 2017;Grachev et al., 2019) and those that leverage pruning techniques (Han et al., 2015;Narang et al., 2017). From this point of view methods based on Bayesian techniques (Louizos et al., 2017;Molchanov et al., 2017) can be considered as a more mathematically justified version of pruning.
We have focused on the pruning in application to word-level language modeling as this task usually involves a large vocabulary, hence, causing weight matrices of the first and the last layers to be huge. Chirkova et al. (2018) also consider Bayesian pruning in language modeling, though their approach is based on the Variational Dropout (VD) technique, which has been proved to be poorly theoretically justified (Hron et al., 2018), whereas ARD does not encounter these issues while maintaining similar efficacy (Kharitonov et al., 2018).
At last, as far as we are concerned, combining DSVI-ARD (or other Bayesian prunning techniques) with weight tying has not been considered previously.

Language modeling with neural networks
The language modeling problem is one of the important classical NLP problems and has various applications such as machine translation and text classification. This problem is usually formulated as probabilistic prediction of a word sequence (w 1 , . . . , w T ) = (w i ) T i=1 as follows: The last equation in (1) is approximated as it is almost always impossible to calculate this expression exactly for a sequence of words of arbitrary length. Therefore, the calculation is performed only within a context of a fixed size T 0 .
Nowadays, approaches that perform the approximation for whole words involve different variations of RNNs Mikolov, 2012) such as LSTM or GRU (Hochreiter and Schmidhuber, 1997;Cho et al., 2014). In wordlevel models the input layer maps words from the vocabulary V to some vector representation, and vice versa for the output layer: from a vector to a distribution over words in the vocabulary. It leads to sizes of these layers being proportional to the vocabulary size |V| which is tens of thousands in a typical use case.
Assume using LSTM cells as recurrent units. We can compute the total number of parameters in a network via the following formula: where L is the number of recurrent hidden layers, and D is the hidden layers size (for simplicity we let all hidden layers be of the same size). Various designs (Bengio and Senecal, 2003;Chen et al., 2016) have been proposed to reduce it, but still in word-level language modeling tasks with a significantly large vocabulary size the second term in the sum (2) makes the largest contribution. Sometimes the softmax (decoder) layer can solely occupy up to a third of the whole network memory space.
The following section describes the technique that performs efficient reduction of parameters in linear layers. This technique can be applied to the decoder layer of RNN (or to both decoder and encoder layers in tied-weight setting) providing the overall network compression with a negligible drop in quality.

DSVI-ARD
In this section we describe how the Doubly Stochastic Variational Inference algorithm for Automatic Relevance Determination (DSVI-ARD) originally proposed in (Titsias and Lázaro-Gredilla, 2014) can be adopted for solving multiclass classification task and, thus, leveraged in neural networks training for compressing their dense layers (in particular, decoder layers in RNNs).

Automatic Relevance Determination
We formulate the multiclass classification problem in a Bayesian framework that can provide a useful tool for feature selection -the so-called Automatic Relevance Determination (ARD). 1 Consider a discriminative probabilistic model given a training dataset (X, Y ) = {(x n , y n )} N n=1 of N independent objects: Here x n ∈ R D is the feature vector of the n-th object, y n ∈ {1, . . . , K} is the label of the nth object's class, W ∈ R K×D is the matrix of model parameters and Λ ∈ R K×D is the matrix of hyperparameters defining the prior distribution p(W | Λ) over parameters W .
The prior distribution p(W | Λ) is considered to be element-wise factorized Gaussian over each element w ij (zero mean and tunable variance λ ij ). The likelihood function p(y n | W, x n ) is the y nth element of the softmax vector of linear logits W x n . Such a likelihood function is quite typical in classification tasks and can be encountered in multivariate Logistic Regression.
The two goals of the subsequent Bayesian inference are, first, obtaining posterior distribution over the model parameters conditioned on the training dataset p(W | X, Y, Λ), which can be leveraged in prediction on the test set, and, second, optimal model selection, i.e., hyperparameters tuning. Both problems can be solved simultaneously by maximizing the Evidence Lower Bound (ELBO): ELBO is a function of two variables: an arbitrary variational distribution over parameters q(W ) and model hyperparameters Λ, and can be decomposed into two parts: the data term E W ∼q(W ) [log p(Y | W, X)] and the negative KLdivergence between the variational approximation q(W ) and the prior distribution p(W | Λ) (KLterm) −KL (q(W ) p(W | Λ)). We further show that maximization of ELBO with respect to both q and Λ solves the model selection problem while also fitting q to the posterior.
ELBO has several useful properties such as L(q, Λ) ≤ log p(Y | X, Λ), ∀q, Λ (bounds the evidence logarithm log p(Y | X, Λ) from below) and L(q, Λ) = log p(Y | X, Λ) if and only if q(W ) = p(W | X, Y, Λ), so maximization of ELBO with respect to q for fixed Λ is equivalent to fitting q to the posterior p(W | X, Y, Λ), hence solving the first of the mentioned Bayesian inference problems.
Maximization of the evidence p(Y | X, Λ) with respect to hyperparameters Λ is a well-known Bayesian model selection method, also known as empirical Bayes estimation (Carlin and Louis, 1997). A model with the highest evidence is considered to be "the best" in terms of both data fit and model complexity. Evidence maximization can be performed via ELBO maximization as Finally, this double maximization procedure, as we have shown above, handles the model selection problem while also fitting q to the posterior.
From the view of the ELBO functional it is clear that only the KL-term KL (q(W ) p(W | Λ) depends on Λ, hence maximization of ELBO with respect to Λ is equivalent to minimization of the KL-term with respect to Λ. Now we restrict the variational distribution q to the factorized Gaussian: where µ, σ ∈ R K×D are the variational parameters. This way ELBO maximization (or equivalently, KL-term minimization) with respect to Λ can be performed analytically with the solution at After substituting Λ * from (7) into the ELBO equation (4) and taking into account the variational family restriction (6) we can rewrite the maximization problem (5) as follows: This equation (8) is the final form of the ARD ELBO maximization problem. We can see that the first term (data term) induces the variational parameters to describe the observed data well by sharpening the variational distribution at the maximum likelihood point, while the second term (KLterm) makes irrelevant parameters shrink. The mutual maximization of both terms leads to a sparse solution (in the limit), at which all redundant features are zeroed. The following subsections describe how it can be performed in practice, especially in application to recurrent neural networks.

DSVI
The Doubly Stochastic Variational Inference (DSVI) is a method of stochastic gradient maximization of ELBO with respect to the variational parameters. We provide the standard DSVI-ARD description in Algorithm 1. At each iteration two types of random variables are sampled: a mini-batch of objects {x m , y m } M m=1 ⊆ (X, Y ) and a set of "proto-weights" for each object in the mini-batch m ∼ N (0, I), m ∈ R K×D , which are used to obtain stochastic gradients of the log-likelihood with respect to the variational parameters via the reparametrization trick (RT) (Kingma et al., 2015). DSVI does not depend on a specific form of the log-likelihood function log p(y | W, x), but only requires its gradient ∇ W log p(y | W, x), so the same procedure is applicable for different models with differentiable log-likelihoods. DSVI can also be regarded as efficient SGD minimization of the negative ELBO loss functional (8), which consists of the data term and the KL-regularizer.

DSVI-ARD in Recurrent Neural Networks
As was noted above, DSVI can be applied to any probabilistic ARD model with differentiable likelihood. A neural network with a softmax layer in-Algorithm 2 Doubly stochastic variational inference for non-independent data Input: log-likelihood log p(y | W, x), training dataset (X, Y ) of size N , learning rates {ρ k }, mini-batch size M Initialize the variational parameters µ (0) , σ (0) , troduces a likelihood function similar to the one considered in (3). Hence, we suggest replacing the softmax output layer with the ARD layer for multiclass classification and train it with the DSVI algorithm computing its log-likelihood gradients via backpropagation due to the usage of the RT.
When training RNN with a DSVI-ARD layer as a decoder (softmax layer in this case) we encounter the question of sampling strategy for parameters: one sample per object or once for the whole mini-batch of objects. The first strategy is typical for standard classification tasks and is implemented in the classical DSVI algorithm 1. The second one is more justified in the RNN case because objects in one sequence (mini-batch) are not independent and should better be processed with the same weights. We propose Algorithm 2, which is applicable in the case of non-iid objects in a mini-batch. Summing it up, it differs from the standard DSVI only in that the "proto-weights" ∈ R K×D are sampled once for the whole minibatch at each iteration.
We also consider applying DSVI-ARD in a tiedweight setting. For that we slightly change the model so that both the encoder and decoder layers contribute into the likelihood via the same set of weights. Now (in the non-iid DSVI algorithm 2) the same set of parameters (weight matrix) is sampled for both layers, and their gradients with respect to the variational parameters are summed to obtain the mutual gradient of log p(y | W, x) for the data term update g Data . The KL-term remains the same as neither new random variables are added to the model nor its prior distribution or variational approximation changes. The only thing that varies is the likelihood of the model, i.e., the data term: now the encoder is also conditioned on the variational parameters µ and σ. This basically means that the gradients w.r.t. the encoder's weights are propagated back to the variational parameters.

Experiments
We have conducted several experiments to test the DSVI-ARD compression approach in language modeling. We used LSTM and LSTM with tied weights models from (Zaremba et al., 2014;Inan et al., 2016) respectively as our baselines: the experiments involved the same LSTM architecture with two hidden layers of size 650 and two datasets: PTB (Mikolov et al., 2010) and Wiki-text2 (Merity et al., 2016); also each mini-batch of objects was constructed from bs word sequences (bs = 10 and bs = 20 for evaluation and training respectively) of length bptt = 35.
We applied dropout after the embedding (except for the tied-weight ARD models because ARD can be regarded as a special form of regularization by itself) and hidden layers, with a dropout rate as a hyperparameter. We used stochastic gradient descent (SGD) as an optimization procedure, with adaptive learning rate decreasing from the starting value by a multiplicative factor (both are hyperparameters) each time validation perplexity has stopped improving.
We also compared our approach to other compression techniques: matrix decompositionbased (Grachev et al., 2019) and VDbased (Chirkova et al., 2018). For the last one we used a similar model: a network with one LSTM layer of 256 hidden units.

Training and evaluation
The whole set of parameters of a model with DSVI-ARD layers can be divided into the varia- Figure 1: Plots of validation cross-entropy (red line) of a LSTM model with a DSVI-ARD softmax layer on the PTB dataset and its corresponding sparsity (blue line) for different possible threshold log λ thresh values (top) and the distribution histogram of its prior log-variances log λ ij (bottom). We display the density on a log scale due to a very sparse distribution. The threshold chosen for further model evaluation (the best in terms of perplexity on the validation set) log λ opt thresh is marked with a green dashed line. tional parameters µ, σ and all the other network parameters (including biases of the DSVI-ARD layers). Variational optimization is performed with the DSVI-ARD algorithm, which, in turn, only requires gradients of the log-likelihood and KL-divergence. Therefore, overall model training is a standard gradient optimization of parameters based on backpropagation (specifically, BPTT in the RNN case) with negative ELBO as the loss function.
For more efficient training we applied the KLcost annealing technique (Sønderby et al., 2016). The idea is to multiply the KL-term in ELBO by a variable weight, called the KL-weight, at training time. The weight gradually increases from zero to one during the first several epochs of training. This technique allows achieving better final performance of the model because such a train-  ing procedure can be considered as pre-training on data (when the data term in ELBO dominates) and then starting fair optimization of the true ELBO (when the KL-weight reaches one). We used a simple linear KL-weight increasing strategy with a step selected as a hyperparameter.
During the evaluation of our models we do not sample parameters as we do in the training phase but instead set the approximated posterior mean µ as DSVI-ARD layers weights. Then we zero out the weights with the corresponding logarithms of prior variances lower than a certain threshold log λ thresh (a hyperparameter selected on valida-tion): This procedure essentially provides the desired sparsity as redundant weights are being literally removed from the network. Each experiment was conducted as follows. We trained several models for some number of epochs with different hyperparameter initialization (such as dropout rate, learning rate, etc.). Then we picked the best model in terms of crossentropy (log-perplexity) on the validation set at the last training epoch. We did not zero weights during evaluation at this phase, in other words, log λ thresh = −∞ in equation (9). After that, we started threshold selection for the picked model: we iterated over possible values of log λ thresh from the "leave-all" to the "remove-all" extreme values and chose the one (denoted by log λ opt thresh ) at which the best validation perplexity was obtained. Finally, we evaluated the model on the test set using the chosen optimal threshold log λ opt thresh . In our results we report the achieved compres- , perplexity and accuracy 2 on the test set.

Results
Table 1 concludes all the results obtained during our experiments.
The comparison of DSVI-ARD with other dense layers compression approaches revealed that our models can exhibit comparable perplexity quality while achieving much higher compression (in Grachev et al. (2019) case) and even surpass models based on similar Bayesian compression techniques (in Chirkova et al. (2018) case).
Also it can be seen that encoder-decoder weight tying helps to obtain higher overall compression (from almost 45% to 70% reduction of all model weights), due to a smaller number of parameters in the whole network, and even better results in both perplexity and accuracy on both datasets. Quality improvement after weight tying is a common case, however, we see that it helps to especially enhance the performance of our models (perplexity drops by almost 10 points in PTB case and more than 16 points in Wikitext2 case), which gives grounds for the proposed combination of DSVI-ARD and weight tying.
One can argue that DSVI-ARD may lead to overpruning (Trippe and Turner, 2018) because in all our experiments (except the last one, comparing with Chirkova et al. (2018) results) a slight quality drop in terms of perplexity can be observed. However, we specifically provide the test accuracy as well, in terms of which we achieve comparable or even better results than original models. We suggest that this effect might be caused by prediction uncertainty (or entropy) increase rather than model quality deterioration. Fig. 1 demonstrates plots of cross-entropy (log-perplexity) and sparsity for different thresholds log λ thresh -essentially the compressionquality trade-off plot -and the log-scaled distribution histogram of the decoder layer's prior log-variances log λ * ij obtained for a DSVI-ARD LSTM model trained on the PTB dataset. We also provide the same distribution on the standard scale (Fig. 2) for comparison. The dashed green line denotes the value of the chosen threshold log λ opt thresh which provides the best validation perplexity. We can observe that the overwhelming majority of weights in the last layer are indeed redundant, i.e., have small prior variances, do not contribute to the inference, and can be removed without harming much model performance. We argue that DSVI-ARD eliminates weights that obstruct generalization while leaving only those actually necessary for correct prediction.

Conclusion
In this paper we adopted the DSVI-ARD algorithm for compressing recurrent neural networks. Our main contributions are extending DSVI-ARD to the case of non-iid objects and combining it with the weight tying technique. In our experiments involving LSTM networks in language modeling tasks, we have managed to obtain substantially high compression ratios at an acceptable quality loss. The proposed method turned out to be comparable to or even surpassing other compression techniques like matrix decomposition and variational dropout.
There are several possible avenues for future work. An intriguing application of the proposed DSVI-ARD method for RNNs is the compression of current state-of-the-art models (Yang et al., 2017;Dai et al., 2019), which require enormous amounts of memory and computational resources. At the same time, one of the drawbacks of the current Bayesian compression approaches is a lack of their expressive ability, i.e., most of them are based on oversimplified posterior approximations and prior distributions (e.g., factorized Gaussian), which may lead to overly rough estimates and overall model inefficiency. A rigorous study of this problem is required. Another possible direction is bringing Bayesian framework into matrix decomposition-based methods as well. This fusion may lead to more effective and justified compression techniques.