Bayesian Compression for Natural Language Processing

In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable.


Introduction
Recurrent neural networks (RNNs) are among the most powerful models for natural language processing, speech recognition, question-answering systems (Chan et al., 2016;Ha et al., 2017;Wu et al., 2016;Ren et al., 2015).For complex tasks such as machine translation (Wu et al., 2016) modern RNN architectures incorporate a huge number of parameters.To use these models on portable devices with limited memory the model compression is desired.
There are a lot of RNNs compression methods based on specific weight matrix representations (Tjandra et al., 2017;Le et al., 2015) or sparsification (Narang et al., 2017;Wen et al., 2018).In this paper we focus on RNNs compression via sparsification.One way to sparsify RNN is pruning where the weights with a small absolute value are eliminated from the model.Such methods are heuristic and require time-consuming hyperparameters tuning.There is another group of sparsification techniques based on Bayesian approach.Molchanov et al. (2017) describe a model * Equal contribution.
called SparseVD in which parameters controlling sparsity are tuned automatically during neural network training.However, this technique was not previously investigated for RNNs.In this paper, we apply Sparse VD to RNNs taking into account the specifics of recurrent network structure (Section 3.2).More precisely, we use the insight about using the same sample of weights for all timesteps in the sequence (Gal and Ghahramani, 2016;Fortunato et al., 2017).This modification makes local reparametrization trick (Kingma et al., 2015;Molchanov et al., 2017) not applicable and changes SparseVD training procedure.
In natural language processing tasks the majority of weights in RNNs are often concentrated in the first layer that is connected to the vocabulary, for example in embedding layer.However, for some tasks the most of the words are unnecessary for accurate predictions.In our model we introduce multiplicative weights for the words to perform vocabulary sparsification (Section 3.3).These multiplicative weights are zeroing out during training causing filtering corresponding unnecessary words out of the model.It allows to boost RNN sparsification level even further.
To sum up, our contributions are as follows: (i) we adapt SparseVD to RNNs explaining the specifics of the resulting model and (ii) we generalize this model by introducing multiplicative weights for words to purposefully sparsify the vocabulary.Our results show that Sparse Variational Dropout leads to a very high level of sparsity in recurrent models without a significant quality drop.Models with additional vocabulary sparsification boost compression rate on text classification tasks but do not help that much on language modeling tasks.In classification tasks the vocabulary is compressed dozens of times, and the choice of words is interpretable.

arXiv:1810.10927v2 [cs.CL] 12 Dec 2018
Reducing RNN size is an important and rapidly developing area of research.There are three research directions: approximation of weight matries (Tjandra et al., 2017;Le et al., 2015), reducing the precision of the weights (Hubara et al., 2016) and sparsification of the weight matrices (Narang et al., 2017;Wen et al., 2018).We focus on the last one.The most popular approach here is pruning: the weights of the RNN are cut off on some threshold.Narang et al. (2017) choose threshold using several hyperparameters that control the frequency, the rate and the duration of the weights eliminating.Wen et al. (2018) propose to prune the weights in LSTM by groups corresponding to each neuron, this allows to accelerate forward pass through the network.
Another group of sparsification methods relies on Bayesian neural networks (Molchanov et al., 2017;Neklyudov et al., 2017;Louizos et al., 2017).In Bayesian NNs the weights are treated as random variables, and our desire about sparse weights is expressed in a prior distribution over them.During training, the prior distribution is transformed into the posterior distribution over the weights, used to make predictions on testing phase.Neklyudov et al. (2017) and Louizos et al. (2017) also introduce group Bayesian sparsification techniques that allow to eliminate neurons from the model.
The main advantage of the Bayesian sparsification techniques is that they have a small number of hyperparameters compared to pruning-based methods.Also, they lead to a higher sparsity level (Molchanov et al., 2017;Neklyudov et al., 2017;Louizos et al., 2017).
There are several works on Bayesian recurrent neural networks (Gal and Ghahramani, 2016;Fortunato et al., 2017), but these methods are hard to extend to achieve sparsification.We apply sparse variational dropout to RNNs taking into account its recurrent specifics, including some insights highlighted by Gal andGhahramani (2016), Fortunato et al. (2017).
3 Proposed method

Notations
In the rest of the paper x = [x 0 , . . ., x T ] is an input sequence, y is a true output and ŷ is an output predicted by the RNN (y and ŷ may be single vectors, sequences, etc.), X, Y denotes a training set {(x 1 , y 1 ), . . ., (x N , y N )}.All weights of the RNN except biases are denoted by ω, while a single weight (an element of any weight matrix) is denoted by w ij .Note that we detach biases and denote them by B because we do not sparsify them.
For definiteness, we will illustrate our model on an example architecture for the language modeling task, where y = [x 1 , . . ., x T ]: However, the model may be directly applied to any recurrent architecture.

Sparse variational dropout for RNNs
Following Kingma et al. (2015), Molchanov et al. (2017), we put a fully-factorized log-uniform prior over the weights: and approximate the posterior with a fully factorized normal distribution: The task of posterior approximation min θ,σ,B KL(q(ω|θ, σ)||p(ω|X, Y, B)) is equivalent to variational lower bound optimization (Molchanov et al., 2017): . (1) Here the first term, a task-specific loss, is approximated with one sample from q(ω|θ, σ).The second term is a regularizer that moves posterior closer to prior and induces sparsity.This regularizer can be very closely approximated analytically (Molchanov et al., 2017): To make integral estimation unbiased, sampling from the posterior is performed with the use of reparametrization trick (Kingma and Welling, 2014): The important difference of RNNs compared to feed-forward networks consists in sharing the same weight variable between different timesteps.Thus, we should use the same sample of weights for each timestep t while computing the likelihood p(y i |x i 0 , . . ., x i T , ω, B) (Gal and Ghahramani, 2016;Fortunato et al., 2017).Kingma et al. (2015), Molchanov et al. ( 2017) also use local reparametrization trick (LRT) that is sampling preactivation instead of individual weights.For example, Tied weight sampling makes LRT not applicable to weight matrices that are used in more than one timestep in the RNN.
For the hidden-to-hidden matrix W h the linear combination (W h h t ) is not normally distributed because h t depends on W h from the previous timestep.As a result, the rule about the sum of independent normal distributions with constant coefficients is not applicable.In practice, network with LRT on hidden-to-hidden weights cannot be trained properly.
For the input-to-hidden matrix W x the linear combination (W x x t ) is normally distributed.However, sampling the same W x for all timesteps and sampling the same noise i for preactivations for all timesteps are not equivalent.The same sample of W x corresponds to different samples of noise i at different timesteps because of the different x t .Hence theoretically LRT is not applicable here.In practice, networks with LRT on input-tohidden weights may give the same results and in some experiments, they even converge a little bit faster.
Since the training procedure is effective only with 2D noise tensor, we propose to sample the noise on the weights per mini-batch, not per individual object.
To sum up, the training procedure is as follows.To perform forward pass for a mini-batch, we firstly sample all weights ω following (3) and then apply RNN as usual.Then the gradients of (1) are computed w.r.t θ, log σ, B.
During the testing stage, we use the mean weights θ (Molchanov et al., 2017).Regularizer (2) causes the majority of θ components approach 0, and the weights are sparsified.More precisely, we eliminate weights with low signalto-noise ratio (Molchanov et al., 2017).

Multiplicative weights for vocabulary sparsification
One of the advantages of Bayesian sparsification is an easy generalization for the sparsification of any groups of the weights that doesn't complicate the training procedure (Louizos et al., 2017).To do so, one should introduce shared multiplicative weight per each group, and elimination of this multiplicative weight will mean the elimination of the corresponding group.In our work we utilize this approach to achieve vocabulary sparsification.Precisely, we introduce multiplicative probabilistic weights z ∈ R V for words in the vocabulary (here V is the size of the vocabulary).The forward pass with z looks as follows: 1. sample vector z i from the current approximation of the posterior for each input sequence x i from the mini-batch; 2. multiply each one-hot encoded token x i t from the sequence x i by z i (here both x i t and z i are V -dimensional); 3. continue the forward pass as usual.
We work with z in the same way as with other weights W : we use a log-uniform prior and approximate the posterior with a fully-factorized normal distribution with trainable mean and variance.However, since z is a one-dimensional vector, we can sample it individually for each object in a mini-batch to reduce the variance of the gradients.After training, we prune elements of z with a low signal-to-noise ratio and subsequently, we do not use the corresponding words from the vocabulary and drop columns of weights from the embedding or input-to-hidden weight matrices.
We perform experiments with LSTM architecture on two types of problems: text classification and language modeling.Three models are compared here: baseline model without any regularization, SparseVD model and SparseVD model with multiplicative weights for vocabulary sparsification (SparseVD-Voc).
To measure the sparsity level of our models we calculate the compression rate of individual weights as follows: |w|/|w = 0|.The sparsification of weights may lead not only to the compression but also to the acceleration of RNNs through group sparsity.Hence, we report the number of remaining neurons in all layers: input (vocabulary), embedding and recurrent.To compute this number for vocabulary layer in SparseVD-Voc we use introduced variables z v .For all other layers in Spar-seVD and SparseVD-Voc, we drop a neuron if all weights connected to this neuron are eliminated.
We optimize our networks using Adam (Kingma and Ba, 2015).Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping.For all weights that we sparsify, we initialize log σ with -3.We eliminate weights with signal-to-noise ratio less then τ = 0.05.More details about experiment setup are presented in Appendix A.

Text Classification
We evaluated our approach on two standard datasets for text classification: IMDb dataset (Maas et al., 2011) for binary classification and AGNews dataset (Zhang et al., 2015) for four-class classification.We set aside 15% and 5% of training data for validation purposes respectively.For both datasets, we use the vocabulary of 20,000 most frequent words.
We use networks with one embedding layer of 300 units, one LSTM layer of 128 / 512 hidden units for IMDb / AGNews, and finally, a fully connected layer applied to the last output of the LSTM.Embedding layer is initialized with word2vec (Mikolov et al., 2013) / GloVe (Pennington et al., 2014) and SparseVD and SparseVD-Voc models are trained for 800 / 150 epochs on IMDb / AGNews.
The results are shown in Table 1.SparseVD leads to a very high compression rate without a significant quality drop.SparseVD-Voc boosts the compression rate even further while still preserv-ing the accuracy.Such high compression rates are achieved mostly because of the sparsification of the vocabulary: to classify texts we need to read only some important words from them.The remaining words in our models are mostly interpretable for the task (see Appendix B for the list of remaining words for IMBb).Figure 1 shows the only kept embedding component for remaining words on IMDb.This component reflects the sentiment score of the words.

Language Modeling
We evaluate our models on the task of characterlevel and word-level language modeling on the Penn Treebank corpus (Marcus et al., 1993) according to the train/valid/test partition of Mikolov et al. (2011).The dataset has a vocabulary of 50 characters or 10,000 words.
To solve character / word-level tasks we use networks with one LSTM layer of 1000 / 256 hidden units and fully-connected layer with softmax activation to predict next character or word.We train SparseVD and SparseVD-Voc models for 250 / 150 epochs on character-level / word-level tasks.
The results are shown in Table 2. To obtain these results we employ LRT on the last fullyconnected layer.In our experiments with language modeling LRT on the last layer accelerate the training without harming the final result.Here we do not get such extreme compression rates as in the previous experiment but still, we are able to compress the models several times while achieving better quality w.r.t. the baseline because of the regularization effect of SparseVD.Vocabulary is not sparsified in the character-level task because there are only 50 characters and all of them matter.In the word-level task more than a half of the words are dropped.However, since in language modeling almost all words are important, the sparsification of the vocabulary makes the task more difficult to the network and leads to the drop in quality and the overall compression (network needs more difficult dynamic in the recurrent layer).

Table 1 :
Results on text classification tasks.Compression is equal to |w|/|w = 0|.In last two columns number of remaining neurons in the input, embedding and recurrent layers are reported.

Table 2 :
Results on language modeling tasks.Compression is equal to |w|/|w = 0|.In last two columns number of remaining neurons in input and recurrent layers are reported.