Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition

Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and language modeling, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.


Introduction
Deep Learning models are the state-of-the-art in NLP, Computer Vision, Speech Recognition and many other fields in Computer Science and Engineering. The remarkable deep learning revolution has been built on top of massive amounts of data (both labeled and unlabeled), and faster computation. In NLP, large pre-trained language models like BERT (Devlin et al., 2019) are state-of-theart on a large number of downstream NLP prob- * Work done during an internship at Huawei Noah's Ark Lab. lems. The largest publicly available language models are trained with hundred of billions of parameters (Brown et al., 2020). In machine translation the state-of-the-art models have parameters in the order of billions. Data privacy and server cost are some major issues, driving research towards deploying these models on edge-devices. However, running these models on edge-devices, faces memory and latency issues due to limitations of the hardware. Thus, there has been considerable interest towards research in reducing the memory footprint and faster inference speed for these models (Sainath et al., 2013;Acharya et al., 2019;Shi and Yu, 2018;Jegou et al., 2010;Chen et al., 2018;Winata et al., 2019).
The architecture of deep-learning-based language generation models can be broken down into three components. The first component, represents the embedding, which maps words in the vocabulary to continuous dense vector representations of the words. In language modeling we typically have one dictionary but machine translation has at least two dictionaries corresponding to a translation pair. We model these as a single dictionary with a common embedding matrix. The second component, consists of a function f , typically a deep neural-network (Schmidhuber, 2015;Krizhevsky et al., 2012;Mikolov et al., 2010) which maps the embedding representation for different NLP problems (machine-translation, summarization, question-answering and others), to the output-space of function f . The third component, is the output layer which maps the output of function f to the vocabulary-space, followed by a softmax function. Since, the first and third components depend upon a large vocabulary-size, they require large number of parameters which results in higher latency and larger memory requirements. For instance, the Transformer Base model (Vaswani et al., 2017) uses 37% of the parameters in the first and third components using a vocabulary size of 50k, and with parameter-tying between the components. The percentage of parameters increases to 54%, when parameters are not shared between the first and third components. Thus, an obvious step towards model compression is to reduce the parameters used by the embedding matrices.
Recently, there has been considerable work on compressing word-embedding matrices (Sainath et al., 2013;Acharya et al., 2019;Shi and Yu, 2018;Jegou et al., 2010;Chen et al., 2018;Winata et al., 2019). These techniques have proven to perform atpar with the uncompressed models, but still suffer from a number of issues.
First, state-of-the-art embedding compression methods such as GroupReduce, Structured Emebedding and Tensor Train Decomposition (Shi and Yu, 2018;Chen et al., 2018;Khrulkov et al., 2019;Shu and Nakayama, 2018), require multiple hyper-parameters to be fine-tuned to optimize performance on each dataset. These hyper-parameters influence the number of parameters in the model, and thus the compression rate. This leads to an additional layer of complexity for optimizing the model for different NLP problems. Additionally, Chen et al. (2018) requires an additional optimization step for grouping words, and lacks end-to-end training through back-propagation. Shi and Yu (2018) also requires an additional step for performing k-means clustering for generating the quantization matrix. Thus, most of the current state-of-theart systems are much more complicated to fine-tune for different NLP problems and data-sets.
Second, all the state-of-the-art embedding compression models compress the input and output embedding separately. In practice, state-of-the-art NLP models (Vaswani et al., 2017;Lioutas and Guo, 2020) have shown better performance with parameter sharing between the two (Press and Wolf, 2017). Thus, there is a need for an exhaustive analysis of various embedding compression techniques, with parameter sharing.
Lastly, embedding compression models not based on linear SVD (Khrulkov et al., 2019;Shi and Yu, 2018) require the reconstruction of the entire embedding matrix or additional computations, when used at the output-layer. Thus during runtime, the model either uses the same amount of memory as the uncompressed model or pays a higher computation cost. This makes linear SVD based techniques more desirable for running models on edge-devices.
In this paper, we introduce Distilled Embedding, a matrix factorization method, based on Singular Value Decomposition (SVD) with two key changes a) a neural network decomposition instead of an eigenvalue decomposition and b) a distillation loss on the word embedding while fine-tuning. Our method, first compresses the vocabulary-space to the desired size, then applies a non-linear activation function, before recovering the original embeddingdimension. Additionally, we also introduce an embedding distillation method, which is similar to Knowledge Distillation (Hinton et al., 2015) but we apply it to distill knowledge from a pre-trained embedding matrix and use an L2 loss instead of cross-entropy loss. To summarize, our contributions are as follows: • We demonstrate that SVD, when fine-tuned till convergence, is comparable to recently proposed, difficult to tune methods.
• We demonstrate that at the same compression rate Distilled Embedding outperforms existing state-of-the-art methods on machine translation and SVD on language modeling.
• Our proposed method is much simpler than the current state-of-the-methods, with only a single parameter controlling the compression rate.
• Unlike the current state-of-the-art systems, we compress the embedding matrix with parameter sharing between input and output embeddings. We perform an exhaustive comparison of different models in this setting.
• Our method is faster at inference speed than competing matrix factorization methods and only slightly slower than SVD.

Related Work
We can model the problem of compressing the embedding matrix as a matrix factorization problem. There is a considerable amount of work done in this field and some of the popular methods include Singular Value Decomposition (SVD) (Srebro and Jaakkola, 2003;Mnih and Salakhutdinov, 2008), product quantization (Jegou et al., 2010) and tensor decomposition (De Lathauwer et al., 2000). A number of prior works in embedding compression are influenced by these fields and have been applied to various NLP problems. In this Section, we will discuss some of the significant works across different NLP problems.
Low-rank Factorization Low-rank approximation of weight matrices, using SVD, is a natural way to compress deep learning based NLP models. GroupReduce Chen et al. (2018) apply weighted low-rank approximation to the embedding matrix of an LSTM. They first create a many-to-one mapping of all the words in the vocabulary into g groups based upon word frequency. For each group g they apply weighted SVD to obtain a lower rank estimation, the rank is determined by setting a minimum rank and linearly increasing it based upon average frequency. Finally, they update the groups by minimizing the reconstruction error from the weighted SVD approximation. They demonstrate strong results on language modeling and machine translation compared to simple SVD. In their models they use different embedding matrices for input and softmax layers and apply different compression ratios to each.
Product Quantization Jegou et al. (2010) introduced product quantization for compressing high dimensional vectors, by uniformly partitioning them into subvectors and quantizing each subvector using K-means clustering technique. Basically, product quantization assumes that the subvectors share some underlying properties which can be used to group similar ones together and unify their representation. That being said, this approach breaks the original matrix into a set of codebooks coming from the center of the clusters in different partitions together with a separate index matrix which refers to the index of the clusters for each subvector. Shi and Yu (2018) applied product quantization to a language model and were able to show better perplexity scores. Shu and Nakayama (2018) extended this technique by first representing the product quantization as a matrix factorization problem, and then learning the quantization matrix in an end-to-end trainable neural network.  implement product quantization through randomly sharing parameters in the embedding matrix, and show good results on perplexity for an LSTM based language model. De Lathauwer et al. (2000) introduced multilinear SVD, which is a generalization of SVD for higher order tensors. Oseledets (2011) introduced an efficient algorithm Tensor Train (TT) for multilinear SVD Tensor. Novikov et al. (2015) applied the Tensor Train decomposition on fully connected layers of deep neural networks. Khrulkov et al. (2019) applied Tensor Train algorithm to the input embedding layer on different NLP problems like language modeling, machine translation and sentiment analysis. They demonstrate high compression rate with little loss of performance. However, they compress only the input embedding and not the softmax layer for language modeling and machine translation.

Tensor Decomposition
Knowledge Distillation Knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015). has been studied in model compression where knowledge of a large cumbersome model is transferred to a small model for easy deployment. In this paper, we propose an embedding factorization of wordembedding matrix using knowledge distillation to mimic the pre-trained word-embedding representation.

Funneling Decomposition and Embedding Distillation
We present an overview of our proposed method in Figure 1. Given an embedding matrix E ∈ R |V|×d , we can decompose it into three matrices (Equation 1), using the SVD algorithm where |V| is the vocabulary size and d is the embedding dimension. Σ is a diagonal matrix containing the singular values, and matrices U and V represent the left and right singular vectors of the embedding matrix respectively. We can obtain the reduced form of the embedding matrix,Ẽ, by only keeping r (< d) largest singular values out of d.
where the matrix U = U Σ. The reduced form of the embedding matrix will need r × (|V| + d) parameters compared to |V| × d.
Our proposed approach in this work, is to apply a non-linear transformation on the matrix U, before reconstructing the original embedding dimension using V (see Figure 1a), as shown in Equation 3, We use the ReLU as our non-linear function f (.) throughout this paper. We postulate that this neural decomposition helps in end-to-end training during the fine-tuning stage, although, we can only demonstrate empirical evidence for that. We train a sequence to sequence model (Sutskever et al., 2014;Vaswani et al., 2017) with tied input and output embedding (i.e. the output embedding is the transpose of the input embedding matrixẼ out = E T = V d×r [f (U |V|×r )] T . We train our model endto-end by replacing the embedding function with Equation 3. The matrix U and V are trainable parameters, and for the output layer we useẼ T , with the parameter sharing. We train on two losses. The standard cross entropy loss defined as: where M is the sequence length, y i is the one-hot representation for the i th label and p i is the softmax probability of the i th term generated by the decoder.
In addition to the cross-entropy loss, we introduce a novel embedding reconstruction loss (Equation 5), which we refer to as embedding distillation as we distill information from the pre-trained embedding into our model, where e i andẽ i are the embedding vectors corresponding to the i th word in the original embedding matrix E and the reconstructed embedding matrix E respectively and u i refers to the i th row of the matrix U. This helps in better generalization since during fine-tuning the words seen in the training corpus are given higher weight at the expense of low-frequency word. This loss helps maintain a balance between the two.
We use Equation 6 as our final loss function where α ∈ [0, 1] is a hyper-parameter, which controls the trade-off between reconstruction and crossentropy loss. L recon acts as the knowledge distillation loss by which we try to distill information from the original pre-trained embedding layer as a teacher to the funneling decomposed embedding layer as a student. The training process of our Distilled Embedding method is summarized in Algorithm 1.

Algorithm 1 Distilled Embedding
Step 1) Pre-training the Embedding Matrix Pre-train the sequence to sequence model with the full embedding matrix for better initialization.
Step 2) Initializing the Weights of Funneling Decomposition Layer We extract the trained embedding matrix E from Step 1 and train our decomposed matrices U and V on reconstruction loss defined in Equation 5, as shown in Figure 1a.
Step 3) Embedding Distillation The pretrained funneling decomposition layer is plugged into the model (replacing the original embedding matrix E) and the entire model is trained based on Equation 6.

Datasets and Evaluation
We test our proposed method on machine translation and language modeling which are fundamental problems in NLP and challenging for embedding compression since we typically have an input and output embedding.
On machine translation, we present results on three language pairs: WMT English to French (En-Fr), WMT English to German (En-De) and IWSLT Portuguese to English (Pt-En). We decided that these pairs are good representatives of highresource, medium-resource and low-resource language pairs.
WMT En-Fr is based on WMT'14 training data which contains 36M sentence pairs. We used Sen-tencePiece (Kudo and Richardson, 2018) to extract a shared vocabulary of 32k subwords. We validate on newstest2013 and test on newstest2014. For WMT English to German (En-De), we use the same setup as Vaswani et al. (2017). The dataset is based on WMT'16 training data and contains about 4.5M pairs. We use a shared vocabulary of 37k subwords extracted using SentencePiece.
For the IWSLT Portuguese to English (Pt-En) dataset, we replicate the setup of Tan et al. (2019) for training individual models. Specifically, the dataset contains about 167k training pairs. We used a shared vocabulary of 32k subwords extracted with SentencePiece.
For all language pairs, we measure case-sensitive BLEU score (Papineni et al., 2002) using Sacre-BLEU 1 (Post, 2018). In addition, we save a checkpoint every hour for the WMT En-Fr and WMT En-De language pairs and every 5 minutes for the IWSLT Pt-En due to the smaller size of the dataset. We use the last checkpoint which resulted in the highest validation BLEU and average the last five checkpoints based on this. We use beam search with a beam width of 4 for all language pairs. For language modeling, we decided to use the WikiText-103 dataset (Merity et al., 2017) which contains 103M training tokens from 28K articles, with an average length of 3.6K tokens per article. We replicate the setup of Dai et al. (2019) for training the base and the compressed models.

Experiment Details
Hyper-Parameters For WMT En-Fr and WMT En-De, we use the same configuration as Transformer Base which was proposed by Vaswani et al. (2017). Specifically, the model hidden size d model is set to 512, the feed-forward hidden size d ff is set to 2048 and the number of layers for the encoder and the decoder was set to 6. For the IWSLT Pt-En, we use Transformer Small configuration. Specifically, the model hidden-size d model is set to 256, the feedforward hidden size d ff is set to 1024 and the number of layers for the encoder and the decoder was set to 2. For Transformer Small, the dropout configuration was set the same as Transformer Base. All models are optimized using Adam (Kingma and Ba, 2015) and the same learning rate schedule as proposed by Vaswani et al. (2017). We use label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary (Szegedy et al., 2016;Pereyra et al., 2017). Additionally, we set the value α of Equation 6 to 0.01.
For the WikiText-103 we use the same configuration as Transformer-XL Standard which was proposed by Dai et al. (2019). Specifically, the model hidden size d model is set to 410, the feed-forward hidden size d ff is set to 2100 and the number of layers for was set to 16.
Hardware Details We train the WMT models on 8 NVIDIA V100 GPUs and the IWSLT models on a single NVIDIA V100 GPU. Each training batch contained a set of sentence pairs containing approximately 6000 source tokens and 6000 target tokens for each GPU worker. All experiments were run using the TensorFlow framework 2 .

Machine Translation
We present BLEU score for our method and compare it with SVD, GroupReduce (Chen et al., 2018), Structured Emedding (Shi and Yu, 2018), Tensor Train (Khrulkov et al., 2019) and a smaller transformer network with the same number of parameters. We learn a decomposition for all the methods except Tensor Train since it was pointed out in Khrulkov et al. (2019) that there is no difference in performance between random initialization and tensor train learnt initialization. Once initialized we plug the decomposed embedding and fine-tune till convergence. None of the weights are frozen during fine-tuning. Table 1 presents the results on translation. We see that on the English-French language pair our method along with Structured Embedding performs the best. Group Reduce is next, and SVD performs better than Tensor Train, showing that SVD is a strong baseline, when fine-tuned till convergence. We also compare against end-to-end compression using a 2 layer neural network (NN) with the same parameterization as distilled embedding which has not been initialized offline. The results show that initializing the neural decomposition with the embedding weights is important.
On English-German translation, our method outperforms all other methods. The smaller transformer network does well and is only surpassed by GroupReduce amongst the competing methods. SVD again performs better than Tensor Train. The Portuguese-English task presents a problem where the embedding matrix constitutes the majority of the parameters of the neural network. The

Model
Emb. CR

Val. PPL
Test PPL  embedding dimension is smaller (256) compared to the other two tasks but embedding compression yields a BLEU score increase in all methods except Structured Embedding. This is due to a regularization effect from the compression. Our model again achieves the highest BLEU score. On these three experiments we demonstrate that our funneling decomposition method with embedding distillation consistently yields higher BLEU scores compared to existing methods.

Language Modeling
As a second task we consider language modeling on the WikiText-103 dataset. We compare our method against SVD with two compression rates. The results are presented in Table 2. We demonstrate that our distilled embedding method consistently yields  lower perplexity (PPL) compared to SVD.

Ablation Study
We present different experiments on machine translation to demonstrate the effect of 1) Model Initialization, 2) Embedding Distillation, 3) Fine-tuning strategies, 4) Compression capability, 5) Alpha Value Sensitivity and 6) Extension and generality of our method.
Initialization We do an ablation study on all the three language pairs defined in Section 4.1, to conclude, if random initialization is better than modelbased initialization. We conclude that model-based initialization, consistently performs better (Table  3). Table 4 presents different compression rates on the Pt-En task, and embedding distillation performs better across all of them. In Table 3, we see that across all language pairs when we initialize our model using weights from the funneling decomposition, we improve when using Embedding Distillation during finetuning. We performed embedding distillation with random initialization only on the smaller Pt-En dataset and observed that Embedding Distillation improves BLEU score even with random initialization.

Embedding Distillation
Compression Rate We demonstrate in Table 4 that it is possible to compress the embedding up to 15.86x with only a 2% drop in BLEU score for Pt-En.
Re-training Fine-tuning is an important component in our method and we demonstrate through our experiments that at convergence most of the techniques are close in performance.   sharp drop in BLEU score, thus, we need to retrain all the weights. The use of a non-linearity and adding embedding distillation also improves BLEU score after finetuning.

Alpha (α) Value Sensitivity Analysis
We performed a sensitivity analysis on the α hyperparameter introduced by our method. Table 6 presents our findings. We can see that the method is not very sensitive to the change in α value. We did not tune the alpha for our different experiments but chose the value which gave us good validation results on the WMT En-De translation task. The results of this analysis suggest that we can gain a little performance if we tune alpha for every dataset.
Extension We experimented with applying two key lessons from our method, namely, using a non-linear function and embedding distillation, to a model initialized with group partitions of the GroupReduce method (Chen et al., 2018), we refer to this method as GroupFunneling. Table 7 shows that, GroupFunneling achieves a higher BLEU score on Pt-En compared to GroupReduce.

Discussion
Importance of Non-linearity We postulate that only a subset of word vector dimensions, explains most of the variance, for most word vectors in the   embedding matrix. Thus, using ReLU activation might help in regularizing the less important dimensions for a given word vector.

Importance of Reconstruction Loss
We propose that the embedding reconstruction might suffer from adding the ReLU activation function. The consequence would be loss of information on words not seen during training and loss of generalization performance. Thus, adding a loss for embedding reconstruction helps in grounding the embedding and not lose a lot of information. The amount of regularization is controlled by the hyperparameter α. Our intuition is partly justified by results shown in Table 5, as reconstruction loss performs worse without the ReLU activation function.

Comparison of Inference Speed
We compare the number of floating-point operations used by different models. Table 8 presents these results.
As it is expected, our method is slightly slower than plain SVD method due to the use of the nonlinear activation function and the bias additions but notably faster than other more complex methods. Structured embedding does not use any additional floating-point operations, though it requires groups − 1 additional embedding lookup and concatenate operations. Also, structured embedding requires the reconstruction of the entire embedding   matrix at the output projection layer, making it ineffective for model compression.
In addition, we demonstrate on Table 9 the average inference time needed for each method to do a forward pass on the IWSLT Pt-En validation dataset which has a size of 7590 examples. We used a single NVIDIA P100 GPU (12GB) with a batch size of 1024. We averaged the time for 30 runs. We did not perform experiments on GroupReduce and Tensor Train, but according to the Table 8 we are expecting these methods to be even slower.

Conclusion and future work
In this paper we proposed Distilled Embedding, a low-rank matrix decomposition with nonlinearity in the bottleneck layer for a shared wordembedding and vocabulary projection matrix. We also introduce knowledge distillation of the embedding during fine-tuning using the full embedding matrix as the teacher and the decomposed embedding as the student. We compared our proposed approach with state-of-the-art methods for compressing word-embedding matrix. We did extensive experiments using three different sizes of datasets and showed that our approach outperforms the state-of-the art methods on the challenging task of machine translation. Our method also generalized well to the task of language modeling. For future work, we will apply our approach to compress feed-forward and multi-head attention layers of the transformer network.

A Appendices
A.1 Additional Hyper-parameters WMT En-Fr Smaller Transformer Network denotes a network with the same configuration as Transformer Base but with hidden size d model of 416. For GroupReduce, to match the same compression rate we used number of clusters c being equal to 10 and minimum rank r min to be 22. For SVD, we decided to set the rank to 64. For Tensor Train, we set the embedding shape to be [25, 32, 40]×[8, 8, 8] and the Tensor Train Rank to be 90. For structured embedding we use group size as 32 and number of clusters as 2048, we then use the quantization matrix and learn the clusters from scratch.
WMT En-De Smaller Transformer Network denotes a network with the same configuration as Transformer Base but with hidden size d model of 400. For GroupReduce, to match the same compression rate we used number of clusters c being equal to 10 and minimum rank r min to be 23. For SVD, we decided to set the rank to 64. For Tensor Train, we set the embedding shape to be [25, 37, 40]×[8, 8, 8] and the Tensor Train Rank to be 90. For structured embedding we use group size as 32 and number of clusters as 2376, we then use the quantization matrix and learn the clusters from scratch.
IWSLT Pt-En Smaller Transformer Network denotes a network with the same configuration as Transformer Small but with hidden size d model of   (Vaswani et al., 2017) based on a 50k dictionary size and tied input and output embedding.
136. For GroupReduce, to match the same compression rate we used number of clusters c being equal to 15 and minimum rank r min to be 30. For SVD, we decided to set the rank to 64. For Tensor Train, we set the embedding shape to be [25, 32, 40]×[8, 4, 8] and the Tensor Train Rank to be 125. For structured embedding we use group size as 32 and number of clusters as 4048, we then use the quantization matrix and learn the clusters from scratch.