MulCode: A Multiplicative Multi-way Model for Compressing Neural Language Model

It is challenging to deploy deep neural nets on memory-constrained devices due to the explosion of numbers of parameters. Especially, the input embedding layer and Softmax layer usually dominate the memory usage in an RNN-based language model. For example, input embedding and Softmax matrices in IWSLT-2014 German-to-English data set account for more than 80% of the total model parameters. To compress these embedding layers, we propose MulCode, a novel multi-way multiplicative neural compressor. MulCode learns an adaptively created matrix and its multiplicative compositions. Together with a prior weighted loss, Multicode is more effective than the state-of-the-art compression methods. On the IWSLT-2014 machine translation data set, MulCode achieved 17 times compression rate for the embedding and Softmax matrices, and when combined with quantization technique, our method can achieve 41.38 times compression rate with very little loss in performance.


Introduction
Deep neural language models with a large number of parameters have achieved great successes in facilitating a wide range of Natural Language Processing (NLP) applications. However, the overwhelming size of parameters used by these models stands out as the main obstacle hindering their deployment on memory-constrained devices. Given the over-parameterization of deep neural nets, the effective compression of them has been receiving increasing attention from the research community. The neural language models typically consist of three major components: the recurrent layers (e.g., a LSTM cell), an embedding layer for representing input tokens, and a Softmax layer for generating output tokens. The dimension of recurrent layers is typically small and independent of the size of * The two authors contribute equally input/output vocabulary. In contrast, the embedding layer and Softmax layer use a dense layer to map the vocabulary to a lower dimensional space, which grows linearly with the vocabulary size and contributes mostly to memory consumption. For example, the One Billion Word language modeling task (Chelba et al., 2013) has a vocabulary size around 800k, and more than 90% of the memory usage is used to store the input and Softmax matrices. Compressing the word embedding, therefore, is the key to reducing the memory usage of deep neural language models.
There exist several studies for compressing neural language models. In particular, quantization (Lin et al., 2016;Hubara et al., 2016) and groupwise low rank approximation (Chen et al., 2018a) achieve the state-of-the-art performance on tasks like language modeling and machine translation. However, these methods are rather computationally inspired than being motivated by intuitions of the semantic composition and thus suffer from poor interpretability. In contrast, deep compositional models (Shu and Nakayama, 2017;Chen et al., 2018b) have been explored to represent the original word vector with a set of discrete pseudo words and thus decompose the word vector to the addition of the corresponding pseudo word vectors learned in an end-to-end manner. These methods are seen as embracing the long-existing yet still, the most popular assumption (Foltz et al., 1998) on compositionality of semantics which states that the meaning of an element (e.g., a word, phrase, or sentence) can be obtained by taking the addition of its constituent parts.
Despite it's more semantically meaningful, current deep compositional models have several limitations. First, additive composition might turn to be counter-intuitive for that the addition does not bring new meaning beyond the individual element to the constructed element (Pinker, 2003;Lapata, 2009, 2008). Besides, exist-ing deep code compressor is found hard to compress the output Softmax layer for that the conflicting codes (i.e. different words having same encoding) might make the language model less discriminative. Last, it is well-known that the distribution of word frequencies can be approximated by a power law. Defining importance of words in terms of frequencies in the corpus is also verified in (Chen et al., 2018a). However, current deep compositional frameworks encode every word equally without considering the frequency information.
In this paper, we propose MulCode, a novel and effective deep neural compressor to address the above issues. MulCode uses a multiplicative composition instead of addition. The multiplicative factors introduce a larger capacity empowering the encoding of more complicated semantic. From the perspective of semantic composition, this allows introducing new information to the base codebook vectors (sub-elements to compose a word vector), taking frequency information of each token into consideration. Technically, MulCode embraces the research line of dense matrix decomposition that has been investigated in many applications (Memisevic, 2011;Novikov et al., 2015). It trains the composition with weighted loss proportional to the frequency of each word. MulCode also adopts an adaptive way of constructing codebooks based on the frequency of each word.
MulCode outperforms state-of-the-arts methods on compressing language models and neural machine translation models. For example, on One-Billion-Word data set, MulCode achieves 18.2 times compression rate on embedding and Softmax matrices without losing much of performance. On IWSLT-14 DE-EN machine translation task, MulCode achieves 17 times compression with BLEU score difference smaller than one percent. Results can be further improved to 41.38 times compression rate when combined with quantization technique.

Model Compression for Convolutional
Neural Network Low-rank Approximation The original word embedding vectors can be viewed as a large matrix with the rows being the words and columns being the vector values. One natural way to approximate it is to obtain the low-rank approxi-mation of the matrix using SVD. Based on this idea, (Sainath et al., 2013) compressed the fully connected layers in neural nets. For convolution layers, (Jaderberg et al., 2014;Denton et al., 2014) applied higher-order tensor decomposition to compress CNN. Similarly, (Howard et al., 2017) developed another structural approximation. (Kim et al., 2016) proposed an algorithm to select rank for each layer.  reconstructed the weight matrices by using sparse plus low-rank approximation.
Pruning Many algorithms have been proposed to remove those unimportant weights in deep neural nets. To do this, one needs to define the importance of each weight. (LeCun et al., 1990) showed that the importance can be estimated by using the Hessian of loss function. (Han et al., 2015b) considered adding 1 or 2 regularization and applied iterative thresholding approaches to achieve very good compression rates. Later on, (Han et al., 2015a) demonstrated that CNNs can be compressed by combining pruning, weight sharing and quantization.
Quantization Using lower precision representations to store parameters has been used for model compression. (Hubara et al., 2016) showed that a simple uniform quantization scheme effectively reduces both the model size and the prediction time of a deep neural net. (Lin et al., 2016) showed that non-uniform quantization can further improve the performance. Recently, several advanced quantization techniques have been proposed for CNN compression (Xu et al., 2018;Choi et al., 2018).

Model Compression for Neural Language Models
Despite model compression has been studied extensively for CNN models, fewer work have focused on the compression for deep neural nets for NLP applications. In fact, directly applying popular approaches to NLP problems does not provide a satisfactory result. (Hubara et al., 2016) showed that the naive quantization can only achieve less than 3X compression rate on PTB data with no performance loss. (Lobacheva et al., 2017) showed that for word-level LSTM models, the pruning approach can only achieve 4 times compression with more than 5% performance loss. Most recently, (Chen et al., 2018a) proposed a decomposition scheme based on block-wise low rank approximation of embedding matrix. Although this method achieves competitive empirical results, the learned model does not have a strong semantic interpretation.
Additive Composition One of the most popular assumptions about the composition of semantics is called the additive composition stating that the meaning of a unit (e.g., word, phrase, or sentences) can be obtained by summing up the meaning of its constituents. At the word level, a word might be decomposed into a set of subword units. For example, "disagree" = "dis"+"agree". Alternatively, a word can also be represented by a set of relevant words, e.g., "king" = "man" + "crown". (Gittens et al., 2017) has validated this particular assumption for a popular word embedding approach (i.e., Skip-Gram ). Based on this assumption, (Chen et al., 2016) created a codebook by splitting the vocabulary into two disjoint sets based on the word frequency: the most frequent words and the rest. Less frequent words are represented using a sparse linear combination of the vectors of more frequent words. Instead of using an explicit set of words, (Shu and Nakayama, 2017) designed the codebooks in a more data-driven fashion where the selection of pseudo words and code vectors are learned automatically using the Gumbel-Softmax trick (Jang et al., 2016). Following the same approach, (Chen et al., 2018b) propose additional training objective to integrate the learning of discrete codes with the training of the language model. Our method, based on the multiplicative composition rather than addition, follows this research line. The effectiveness of multiplicative composition is verified in some language modeling tasks Lapata, 2009, 2008).

Multi-way Multiplicative Codes
Compositional models start with defining a set of d m dimensional vectors, which serve as basic codes to compose the targeted embedding matrix. These vectors could further be separated into M groups and say each group contains K vectors. We call each group a codebook, and each d m dimensional vector in the codebook a codeword.
An 1-way codebook then is defined as a R 1×M ×K×dm tensor. Correspondingly, we define N -way codebooks as U ∈ R N ×M ×K×dm , where each of the N ways consists of a set of M codebooks, each codebook contains K words, and each codeword is associated with a d m dimensional vector representing its semantics.
Since U is a high-order tensor which is not memory-friendly, we model the composition of U as applying a customized multiplicative operator on two tensors C ∈ R M ×K×dm and S ∈ R N ×M ×dm as where is a multiplicative operator defining that σ(·) stands for the tangent hyperbolic function, and • is the Hadamard product (entry-wise product). Tensor C is referred to as the base codebook, consisting of M codebooks where each codebook contains K codeword vectors of d m dimensions. S is called rescaling codebooks. Each codeword in the base codebook is injected with new meanings by a code vector in the rescaling codebook.
Despite rescaling codebook uses much less memory (O(N M d m )) than simply creating a N times larger set of vectors (O(N M Kd m )), using rescaling codebook still introduces additional costs of the memory. We propose to further reduce the cost by allowing rescaling code vectors to be shared. We replace S with aS ∈ R N ×dm . That is, for each of the N way codebook, we use only one d m -dimensional vector to rescale the base codebook C. The number of parameters in S can thus be reduced and the computation of U becomes Now, with the N -way codebook defined by C and S, given an embedding matrix E with V vocabularies (i.e. E ∈ R V ×dm ), we could represent E by finding an encoding Q ∈ R V ×N ×M to compose E from C andS. Q i contains encoding of corresponding vocabulary V i . From each of N -way n, and each of M codebooks m, Q i,n,m indicates which codeword to compose. Let the word vector for the ith word be e i . We could construct e i by The N -way discrete code could be learned in an end-to-end manner by using the Gumbel-Softmax trick (Jang et al., 2016). We first compute an encoding vector for the original word vector e i by feeding it to a neural network where W , W , b and b are the parameters of the network, and φ is the softplus function. a i represents a real value tensor in a shape N × M × K, which is then fed to the Gumbel-Softmax to generate a continuous approximation of drawing discrete samples with respect to the last dimension where g is a random noise vector sampled from Gumbel distribution, and τ is the Softmax temperature controlling how close is the sample vector to a uniform distribution.D i is an approximately one-hot out of K drawing for each way N and codebook M . We can then compute approximation of each dimension of e i aŝ e i,d = U :,:,:,d D i Note that during trainingD i is generated as a continuous approximation of the N -way discrete code. At the testing phase, the fixed encoding Q i of vector e i is directly computed as Q i,n,m = arg max k a n,m,k . More details on Gumbel-trick could be found in (Jang et al., 2016).

Group Adaptive Coding
Given the rescaling codebooksS is small, the memory consumption mainly consists of two parts: code vectors C and discrete codes Q. On the one hand, the code vectors normally account for the major memory usage when dealing with relatively smaller vocabulary size. On the other hand, the size of discrete codes grows linearly with the size of vocabulary and the logarithm of K. To further reduce the memory usage, we propose to use codes of adaptive length and dimensions to deal with the linear dependency with vocabulary size. The intuition is to encode frequent words with a longer code length to achieve a relative lower reconstruction loss while representing rare words with fewer codes. To achieve this, we first sort the words according to their frequency and then split words into fixed number of groups G. The ith group could access only In the same time, we store the low-rank version of code vectors for some codebooks that were mainly used for representing rare words.
where c i ∈ R d G i and W G i ∈ R d G i ×dm is the linear transformation matrix to be shared by all the codebooks in the ith group. We resolve d G i in an intuitive way as shown in Algorithm 1.
In practice, high frequency words tend to get accesses to codebooks with higher rank while fewer frequent words can only access codebooks of lower dimension. Thanks to the long tail of rare words, this could actually help save considerable memory space. Normally, using a low compression rate results in d G i γ G i × K so that it is guaranteed to reduce the number of parameters in the base codebook.

Prior-weighted Reconstruction Loss
According to Zipf's law, the frequency of words conforms to a power law distribution. It means that word which falls to the long tail may rarely appear in the sentence. It motivates the modification on the learning objective so that the compressor will be more focused on words with high frequency. In this paper, we use the distribution of Sort set G according to frequency; d': the minimum dimension ; δ : targeted compression rate; f G i : frequency of group G i ; o G i : bits consumed by discrete code matrix Q for each word in G i ; n ←δ × V × d m × 32(bits) ; for i ← 1 to |G| do compute a ratio based on frequency update n by subtracting used bits; Algorithm 1: Algorithm to resolve the dimension of adaptive codebooks towards achieving a targeted compression rate.
the words in the training set as a prior knowledge to guide the learning of the compressor. In addition, similar to (Lin et al., 2017) we also want to let each of the N way encoding focus on different aspects of the original word embedding. The proposed training objective function then becomes , wherep i represents the empirical distribution of word V i in the training set, e i is original word vector,ê i is the reconstructed vector andv i ∈ R N ×dm is a compilation of reconstrcuted vectors from all N -way codebooks in a matrix form (i.e.v i,n = M m=1 C m,Q i,n,m σ(S n ) is the reconstructed vector from nth-way codebook). The new objective function thus focuses on high-frequency words and in the meanwhile allows the N-way coding to encode different information.

Data Sets and Models
Following the experimental protocol of (Chen et al., 2018a), we evaluate our proposed method with two important NLP tasks: language modeling and machine translation. Table 1 summarizes the key characteristics of the four data sets  (Jozefowicz et al., 2016)) trained on the OBW data set and the vocabulary size is 793,471. For LSTM-based language models, the input embedding matrix and Softmax embedding matrix account for the major memory usage (up to 91.2%). Therefore, we target compressing both the input and Softmax embedding matrices.

Implementation Details
We compress both input embedding and Softmax matrices. We trained MulCode by using Adam optimizer with learning rate 0.001. For PTB data set, we group the vocabulary using 3 groups and 8 groups for OBW. To resolve the dimension of codebooks for adaptive coding, we use targeted compression rate δ = 0.2 for PTB-small and δ = 0.05 for the rest three models 1 . After approximation, we retrain the rest of parameters by SGD optimizer with initial learning rate 0.01. Whenever the validation perplexity does not drop down, we decrease the learning rate to an order smaller. We did not include results of fine-tuning on OBW for that the re-training process takes too long (few days) which is not compliant with our motivation to compress the given pre-trained embeddings.
The compression rate and corresponding performance could certainly be plotted as a spectrum graph. The more we compress, the larger the performance drop. In this paper, as far as BLEU score is concerned, we report results of compressed models when the BLEU falls within 3 percent difference from the original score. For PTB data set, we target 3 percent drop of perplexity (PPL) after retrain. For OBW data set, since it has a larger vocabulary size, we report results within 10 percent difference from the PPL achieved by the uncompressed model. For each method we tested various parameters and report the smallest model size of the compression fulfilling above criteria.
Notice that some previous methods compress model directly during training phase (Khrulkov et al., 2019;Wen et al., 2017). In contrast, our problem setup follows (Chen et al., 2018a;Shu and Nakayama, 2017;Chen et al., 2018b) that given a pre-trained model, we want to compress the model with limited fine-tuning.

Comparison with Baseline Models
We refer to our proposed method as MulCode (Mul stands for both multi-way and multiplicative composition). We mainly compare with two stateof-the-art baseline compressors targeting compressing the embedding layer.
1. GroupReduce We refer to the results reported in (Chen et al., 2018a).

2.
DeepCode The additive composition model by (Shu and Nakayama, 2017). We use the pytorch code 2 released by (Shu and Nakayama, 2017) to produce the results. Table 2 summarizes the comparison between the proposed methods and state-of-the-art baselines for the four benchmark data sets and LSTM models. MulCode manages to compress the input embedding layer and Softmax embedding layer 6 to 18 times without suffering a significant loss in the performance.
In comparison, all the baseline models achieve much lower compression rate with PTB-small which has only 200 dimensions. It is reasonable since embedding layers of PTB-small contains less redundant information and thus can be hardly compressed. As compared with DeepCode, our method achieves much higher compression rate 3 for all the four models. Our method also consistently and significantly outperforms GroupReduce.

Comparison with Quantization
Quantization has been proven to be a strong baseline. In fact, the discrete coding of MulCode can be considered equivalent to a trainable quantization. On the other hand, we need to point out that quantization is not orthogonal to MulCode. MulCode could be combined with quantization to achieve better performance. Specifically, the M × K base codebooks as well as the rescaling codebook could be quantized to further reduce the memory usage. We summarize the results in Table 3. Quantized MulCode could achieve more than 30.8 times compression for both input embedding and Softmax matrix in OBW. In addition, on machine translation task, it achieves 41.38X with BLEU score drops around only 1% after retraining. In particular, we observe that the effect of retraining is more prominent for MulCode and simple quantization compared to GroupReduce. This implies that local precision lost for low-rank basis in GroupReduce is more difficult to be recovered. In contrast, the collective information of MulCode due to the compositional property is more robust when imprecise local vectors present.

Ablation Analysis
We summarize the ablation analysis in Table 4. The major technical features of MulCode are rescaling codebooks (RC), prior-weighted reconstruction loss (PRL), and the group adaptive coding (GAC). Besides, we also consider using full rescaling books (Full) in place of the one shared across M codebooks. We remove or add these features one at a time. First, the rescaling codebook is shown contributing largely to the performance. On the other hand, since removing the rescaling codebook is equivalent to degrading to using pure additive composition, it also suggests the multiplicative composition is a better alternative to the additive composition. In the meantime, removing the rescaling codebook is equivalent to repeatedly selecting from the base codebook. It may force the codeword vector to encode the original meaning of  words as a mix of different aspects of the semantics, which may turn out hard to be recovered. In fact, we find it difficult to train a model that can achieve low perplexity loss (33) on OBW with the rescaling codebooks removed.
Removing the prior-weighted loss from the model results in only marginal loss in the performance except for PTB-small. It indicates that, given the limited capacity of codebooks, the performance of compressed model could benefit more from lower reconstruction loss for those words with the highest frequencies. For example, Mul-Code achieves 3.5% less perplexity loss compared to the model without using a prior-weighted reconstruction loss. Group adaptive coding is another key factor to reduce memory usage. It shows that it works well with all the four models. It handles the memory reduction in both ways: reducing the parameters in the code vectors and reducing the number of discrete codes to represent a word. For model with a very large vocabulary, it seems to achieve higher memory reduction than with a smaller model. We conjecture it is largely due to the prominent presence of rare words in data set like OBW. The significance of PRL and GAC verifies the importance of frequency in compressing natural words. Lastly, using the full tensor for rescaling codebook does not seem to be necessary. It fails to produce any improved performance of the compressed model at the expense of a nontrivial compression rate drop.  with varying setting of M , N , and K. The departure point of this experiment is using M = 32, K = 32, N = 8. We then adjust the three parameters one at a time while fixing the other two. In order to plot the results in a single figure, we set the x-axis as the memory usage instead of different values of the three parameters. As shown in Figure 2, adjusting the values of M , N , and K have similar effects on the PPL when the compression rate is low (≤7X). With a larger compression rate, the performance becomes much more sensitive to the change of K. It suggests that it is safer to maintain a high value for K while tuning the rest two parameters for the purpose of securing a reasonable performance of the compressed model.

Understanding Composed Codes
At the core of our approach is the multiplicative composition from two sets of codebooks. Hence, it is interesting to investigate what has been encoded in the N -way codebooks. One assumption is that each code in the base codebooks encodes a mix of information which can be disentangled by the rescaling codebook. We encode all the words of OBW corpus by a 32 × 8 × 4 codes. We compute the hamming distance of example query words (shown in Table 5) for each of the N -way codes. We select the top ones with the smallest hamming distance from the 10,000 most frequent words that are likely to have low reconstruction errors.
Since the N -way codings are generated by selecting from the base codebooks and modified by rescaling codebooks, each channel can be seen as meaningful subspace. It shows that each of the Nway codings might have encoded a different subspace of the original meaning of words, including tenses (e.g., halt v.s. halted), plurals (e.g., bank v.s. banks), synonyms (e.g., soccer v.s. football),  co-occurrence (like v.s. just), topical relatedness (soccer v.s. hockey). It verifies that the multiplicative composition used in our approach is able to introduce new information to the base codebook.

Conclusion
In this paper, we propose a novel compression method for neural language models. Our method applies multiplicative factors on end-toend learned codebooks. Our method also considers the frequency information in the corpus by adding weighted loss according to the importance. At the same time, the coding scheme is made adaptive based on the frequency information. Experimental results show that our method outperforms the state-of-the-art compression methods by a large margin. In particular, on the IWSLT-14 data set, our method combined with quantization achieves 41.38 times compression rate for both the embedding and Softmax matrices. It will facilitate deployment of large neural language models on memory-constrained devices.