Rank and run-time aware compression of NLP Applications

Sequence model based NLP applications canbe large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints.As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization (HMF) that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-timethan pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection,Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.


Introduction
Sequence based (LSTMs/GRUs) NLP Applications are being increasingly run on mobile phones and smart watches. They are typically enabled by querying a cloud-based system to do most of the computation. The energy, latency, and privacy implications associated with running a query on the cloud is changing where users run a neural network application. We should, therefore, expect an increase in the number of NLP applications running on embedded devices. Due to the energy and power constraints of edge devices, embedded SoCs frequently use lower-bandwidth memory technologies and smaller caches compared to desktop and server processors. Thus, there is a need for good compression techniques to enable large NLP models to fit into an smaller edge device or ensure that they run efficiently on devices with smaller caches (Thakker et al., 2019b). Additionally, compressing models should not negatively impact the inference run-time as these tasks may have real-time deadlines to provide a good user experience.
In order to choose a compression scheme for a particular network, one needs to consider 3 different axes -the compression factor, the inference run-time speedup over the baseline, and the accuracy. Ideally, a good compression algorithm should not sacrifice improvement along one axis for improvement along another. For example, network pruning (Han et al., 2016) has shown to be an effective compression technique, but pruning creates a sparse matrix representation that is inefficient to execute on most modern CPUs. Our analysis shows that pruned networks can achieve a faster run-time than the baseline only for significantly high compression factors. Low-rank matrix factorization (LMF) is another popular compression technique that can achieve speedup proportional to the compression factor. However, LMF has had mixed results in maintaining model accuracy (Grachev et al., 2017;Chen et al., 2018;Lu et al., 2016). This is because LMF reduces the rank of a matrix significantly, reducing its expressibility (Yang et al., 2018). Lastly, structured matrices  can also be used to compress neural networks. While these techniques show a significant reduction in computation, this reduction only translates to a realized run-time improvement for large matrices (Thomas et al., 2018) or while using specialized hardware Sindhwani et al., 2015). For benchmarks evaluated in this paper, HLF gets 30× speed-up improvement over structured matrix based technique (Sindhwani et al., 2015). Given LMF's good run-time characteristics, it can potentially act as an alternative to pruning. However, LMF leads to an accuracy loss. To overcome the problem of finding an alternative to pruning, which preserves the run-time benefit of dense structures of LMF and the accuracy benefits of pruned networks, we introduce a new compression technique called Hybrid Matrix Factorization (HLF). HLF can act as an effective compression technique for NLP edge use cases on embedded CPUs. The results are very promising -HLF achieves iso-accuracy for a large compression factor (2× to 4×), improves the CPU run-time over pruning by a factor of 2.32× and can achieve 16.77% better model accuracy than LMF and 9% better accuracy than smaller baselines.

Related Work
Pruning (Han et al., 2016;Zhu and Gupta, 2017;Sanh et al., 2020) has been the most successful compression technique for all types of neural networks. Poor hardware characteristics of pruning has led to research in block based pruning technique (Narang et al., 2017). However, block based pruning technique also requires certain amount of block sparsity to achieve faster run-time than baseline. Having a strict compression factor requirement to get better run-time is a stringent constraint that HLF manages to avoid.
Structured matrices have shown significant potential for compression of NN (Sindhwani et al., 2015;Ding et al., 2017;Cheng et al., 2015;Thakker et al., 2020). Block circular compression is an extension of structured matrix based compression technique, converting every block in a matrix into structured matrix. We will show in this paper that HLF is a superior technique than block circular decomposition.
Tensor decomposition (CP decomposition, Kronecker, Tucker decomposition etc) based methods have also shown significant reduction in parameters (Tjandra et al., 2017;Thakker et al., 2019c,a). Matrix Factorization (Kuchaiev and Ginsburg, 2017;Chen et al., 2018;Grachev et al., 2017; can be categorized under this topic. We will show in this paper that HLF can lead to better accuracy than LMF compressed RNNs.
Quantization is another popular technique for compression (Hubara et al., 2017(Hubara et al., , 2016Gope et al., 2020a;Sanh et al., 2019;Gope et al., 2020b. Networks compressed using HLF can be further compressed using quantization. Dynamic techniques are used to improve inference run-time of RNNs by skipping certain RNN state updates (Campos et al., 2018;Seo et al., 2018;Tao et al., 2019). These techniques are based on the assumption that not all inputs to a RNN are needed for final classification task. Thus we can learn a small and fast predictor that can learn to skip certain inputs and its associated computation. HLF technique is orthogonal to this technique and networks compressed using HLF can be further optimized using this technique.
Design of efficient structures for LSTM/GRU cells like SRU (Lei et al., 2018), QRNN (Bradbury et al., 2016) and PRU (Mehta et al., 2018) have also led to networks with faster inference run-time benefits or lesser number of parameters. These structures are different from structured matrices and are hand-crafted after better understanding the application domain. HLF can be further used to optimize the matrices in these architectures to make the resultant network more parameter and run-time efficient.
Finally, any technique used to reduce the parameter footprint of embedding matrices in NLP can further optimize RNN networks optimized using HLF (Acharya et al., 2019;Mehta et al., 2019). In this paper, we show that HLF can compress networks with compressed word embedding layers.

Hybrid Matrix Factorization
3.1 Why LMF can potentially lead to loss in accuracy LMF (Kuchaiev and Ginsburg, 2017) expresses a larger matrix A ∈ R m×n as a product of two smaller matrices U ∈ R m×r and V ∈ R r×n , respectively. Parameter r controls the compression factor. Unlike pruning, matrix factorization is able to improve the run-time over the baseline for most compression factors. Unfortunately, compression via LMF can lead to loss in accuracy. We believe, this is because of two closely related reasons: • Rank-Loss: The rank of a matrix is a measure of the expressibility of a matrix. A lower rank matrix means less expressibility, limiting its learning capacity. This can potentially lead to some accuracy loss. LMF compression leads to a lower ranked matrix. While before compression, the rank of matrix A is min(m, n), after compression, it becomes min(m, n, d).
Eg -If A ∈ R 256×256 , compression using LMF by a factor of 2 leads to U ∈ R 256×64 and V ∈ R 64×256 . The resultant compressed matrix A (= U * V ) is a 64 rank matrix. Thus, in order to compress the matrix by a factor of 2, LMF reduces the rank of a matrix by a factor of 4.
• Less expressive output features: A closely related argument can be viewed when we extend the idea of low-rank matrix and its impact on the output features. Without loss of generality, an LSTM/GRU layer calculates a matrix-vector product during inference. If we assume the parameters of a LSTM/GRU layer are represented by a matrix A ∈ R m×n and the input to the matrix is x ∈ R n×1 , then the output feature calculated is - . f is a non-linear function. Thus, each element of y is a dot product of a row of A and the vector x followed by non-linearity. LMF expresses A in a lower dimensional space using the U and V matrix. If we rewrite the equation to calculate y, when A is expressed using LMF, we get - Generally, for compression r < m, n. Thus, x ∈ R n×1 is projected to a lower dimensional embedding of size R d×1 and expanded again to R m×1 to create y. Thus, compressing A to a lower rank leads to output features calculated from a lower dimension embedding vector.

Hybrid Matrix Factorization
This paper introduces a new compression technique that uses dense matrix representation to ensure fast run-time properties and avoids making the strong assumptions made by LMF. This technique is based on three assumptions -• A1: Rank of a matrix is important to create a high-task accuracy LSTM/GRU network (Yang et al., 2018) HLF is based on the assumption that a more relaxed constraint of having a hybrid output feature vector, where some elements are calculated from a lower dimensional embedding space and other's from a higher dimensional embedding space can lead to better accuracy.
• A3: Most LSTM/GRU networks are followed by a fully-connected softmax layer or another LSTM/GRU layer. Even if the order of the elements in the output of a particular RNN layer changes, the weights in the subsequent fully connected or LSTM/GRU layers can adjust to accommodate that. Thus, the order of the elements of the output vector of LSTM/GRU layer is not strictly important.
These three intuitions of a LSTM/GRU layer can be used to create a more hardware-friendly compression scheme. This paper introduces one such scheme -Hybrid Matrix Factorization.
Hybrid Matrix Factorization (HLF) splits the input and recurrent matrices in an LSTM/GRU layer into two parts -a fully parameterized upper part and low-rank lower part. Figure 1 shows the strategy we use to decompose the matrix -an unconstrained upper half A and a lower half that is composed of k rank-1 blocks. If we decompose the weight matrix using this tech-Algorithm 1 Matrix vector product when a matrix uses the HLF technique Input 1:  nique, the parameter reduction is given by: Thus, the maximum rank of the matrix becomes j + k. Different values of j and k can be used to control the amount of compression and the rank of the matrix. Structuring a matrix as shown in Figure 1 can lead to significant increase in maximum rank of the compressed matrix. Table-1 shows the maximum possible rank of a 256 × 256 matrix compressed to the same number of parameters using the two compression techniques -LMF and HLF. As shown, HLF can effectively double the rank of the matrix for the same number of parameters. To compress a matrix by a given compression factor, HLF has 2 different parameters, j and k, to regulate the rank of the matrix. Hence, we see a range of rank values. Maximum rank is achieved when k=1. The value for j when k=1 can be calculated for different compression factors using equation 4.
Apart from the storage reduction, HLF also leads to a reduction in the number of computations. Assuming a batch size of 1 during inference, HLF leads to inference speed-up by using the associative property of matrix products to calculate the matrix-vector product -Algorithm 1 shows how to calculate the matrix vector product when the matrix is represented using HLF. This algorithm avoids expanding the matrix A , B and C into A.
Algorithm 1 uses the associative property of matrix products to gain the computation speedup. For a matrix vector product between a matrix of size m × n and a vector of size n × 1, the number of operations required to compute the product is m × n (Trefethen and Bau, 1997). Referring to Algorithm 1, number of operations required to calculate O 1:j is j×n. The Temp1 variables need k * n operations and calculating O j+1:m needs k*(m-j) operations. Thus, the reduction in number of operations when we use Algorithm 1 is:

Impact on output feature vector
Algorithm 1 shows that, HLF divides the output into two stacked sub-vectors. One is a result of a fully-parameterized multiplication, A × I (Line 1, Algorithm 1). The other is the result of the low rank multiplication : B × C × I (Line 2-3, Algorithm 1). Thus, the upper sub-vector has "richer" features created from a higher dimensional embedding, while the lower sub vector has "constrained" features created from a lower dimensional embedding. By incorporating the HLF structure during training, we force an RNN to learn "richer" features in the upper sub-vector and the "constrained" features in the lower sub-vector. Because a RNN is followed by another RNN or a softmax layer, this restructuring should not impact the subsequent layers. Thus, HLF structure combines the assumptions A2 and A3 that were discussed previously.
3.3 Why HLF leads to larger rank than LMF for same number of parameters?
HLF is an extension of LMF. To understand this, let us revisit Figure 1, where the matrix where A ∈ R m×n , A ∈ R j×n , B ∈ R (m−j)×k and C ∈ R k×n . Then we can rewrite the matrix as, where I ∈ R j×j , 0 1 ∈ R j×k and 0 2 ∈ R (m−j)×j . The above equation could be re-written as - where U ∈ R m×(j+k) and V ∈ R (j+k)×n . Both U' and V' can have a maximum rank of j + k. The maximum value of this rank is achieved when k=1 and j is calculated as discussed in Table 1. Let this value be d. A standard LMF decomposition of A will also lead to a representation of the form U V , but this representation will have same parameters as HLF only if the rank of both U and V is at most (d + 1)/2. Thus, HLF can be regarded as a (d + 1) ranked LMF of A, with a sparsity forcing mask that reduces the number of parameters to express the (d + 1) ranked matrix significantly. This is why HLF can double the rank of the matrix when compared to an iso-parameter LMF matrix. Neural networks seldom learn structured sparsity unless they are forced to (Narang et al., 2017), thus, an RNN trained with the LMF structure will rarely end up learning the same structure as HLF. The pre-determined HLF structure effectively creates a sparsity forcing mask. Such a sparsity forcing mask also leads to creation of the decoupled output feature vectors as described in section 3.2.1.

Results
We compare HLF with LMF and 3 other compression techniques -model pruning, small baseline and a structured matrix based technique called block circular decomposition. These techniques and why they need to be considered are discussed below: • Pruning: Model pruning (Zhu and Gupta, 2017) induces sparsity in the matrices of a neural network, thereby reducing the number of non-zero valued parameters that need to be stored. Pruning creates sparse matrices which are stored in a specialized sparse data structure such as CSR. The overhead of traversing these data structures while performing the matrixvector multiplication can lead to poorer inference run-time than when executing the baseline, non-sparse network. Thus, while pruning is an effective compression technique, its runtime performance on CPUs can make it a less appealing choice for compression. We use the magnitude pruning framework provided by (Zhu and Gupta, 2017). While there are other possible ways to prune, recent work (Gale et al., 2019) has suggested that magnitude pruning provides state-of-the-art or comparable performance when compared to other pruning techniques (Neklyudov et al., 2017;Louizos et al., 2018).
• Small Baseline: Additionally, we train a smaller baseline with the number of parameters equal to that of the compressed baseline. This serves as a useful point of comparison because of two reasons.
-First, to check if compression of a larger network leads to better accuracy than compressing a network by reducing its dimensions (size of hidden layer or number of layers). This can help us verify if the network was originally overparameterized. -Second, to establish the hypothesis whether HLF's creation of a stacked output feature vector as described in section 3.2.1 adds any useful information in the network. Smaller baseline creates output feature vector that is created from a high-dimensional embedding only. HLF, additionally concatenates the output features created from lower dimensional embedding. Thus, comparing the accuracy of HLF with Smaller baseline helps evaluate the usefulness of the output features created using lower dimensional embedding.
Given the significant slow-down of inference of BCD compressed networks, we do not discuss the results sing BCD compression in the rest of the paper.

Experiment Setup
Measuring inference run-time: In order to compare the inference run-time of RNN cells compressed using pruning, LMF and HLF, we implemented these cells in C++ using the Eigen library. This paper focuses on inference on an edge device. As a result, we make the assumption that the batch size of the application will be 1 while   (c) Language Modeling: For LM, lower values of perplexity are better. HLF provides a viable alternative to both pruning and LMF at compression factors of 2.5x, 3.0x and 5x. HLF can achieve 17% better perplexity than LMF, 9% better perplexity than small baseline and 2.32× better inference run-time than pruning. For each compression factor, the compression scheme that is most to the top-right is the ideal choice. In case of perplexity, lower values are better. Thus, the graphs are plotted in a slightly different way to still adhere to the fact that the most ideal choice of compression is in the top-right corner. P = Pruning, LMF = Low rank matrix factorization, HLF = Hybrid matrix decomposition, SB = Smaller baseline. The best way to view this figure is to either focus on a compression point and see how the Pareto curve of speed-up vs accuracy changes as we add HLF or focus on an accuracy region and see what compression schemes provides the best run-time at highest compression factor. measuring the run-time of an application. However, the observations regarding run-time should remain consistent for larger batch sizes as well. We ran our experiments on a single cortex-A73 core of the Hikey 960 board. The size of L3 cache is 2MB.
Training infrastructure: We use Tensorflow 1.14 to train our networks on a cluster of 2 RTX 2080 Ti NVidia GPUs with 11 GB Memory. The training settings for the benchmarks evaluated can be found in the reference paper for each benchmark.
What do we compress? We compress the LSTM/GRU layers in each application. We do not compress the embedding layers using HLF.
The amount of compression determines the rank of the compressed matrices when we use LMF to compress an NLP application and the sparsity of the pruned matrix when we use pruning as our choice of compression technique. Similar to LMF, the amount of compression determines the rank of the compressed matrices when we use HLF as our choice of compression. However, for HLF two parameters j and k control the rank of the matrix. We use a sweep starting with k=1 to determine the exact values of j and k that help us achieve a good accuracy.
What do we compare?: We compare the accuracy and inference run-time of all compression technologies at iso-compression factors for various compression techniques.

Comparison of compression techniques across different ML tasks
The impact of compression on accuracy is compared for 5 benchmarks -machine translation, natural language understanding (intent detection and slot filling), text classification and Language Modeling. These tasks are some of the most important NLP applications that run on edge and embedded devices like smart phones, smart watches and smart homes.

Language Translation
We use the English to Vietnamese translation model in (Luong et al., 2017). The model uses 2-layer LSTMs of size 512 units with bidirectional encoder (i.e., 1 bidirectional layers for the encoder), embedding of dim 512 and an attention layer. We used the hyper-parameters in (Luong et al., 2017) to train the network while modifying the learning rate values used. We sweep the learning rates values from ×0.1 to 3.0 in multiples of 3. For HLF, we used the value of k=2. Figure 2a shows the results of compressing the LSTM layers in the NMT VIEN baseline by 2.5×, 3.33× and 5×. HLF improves the BLEU score achieved by LMF by 2.3% to 4.5% and by Small Baseline by 2.8% to 4.1%. At the same time, HLF improves the inference runtime over pruning by 1.5 × −1.74×.

Language Modeling
We use the medium LM model from (Zaremba et al., 2014) as our baseline. The PTB (Medium) baseline has 2 LSTM layers each with a hidden vector of size 650 with a vocabulary size of 10,000 words from the English vocabulary. We used the hyper-parameters in (Zaremba et al., 2014) and train the compressed networks for 50 more epochs than in baseline. The baseline network is trained for 39 epochs. For the first 6 epochs the learning rate used is of value 1, and after that we decrease it by a factor of 1.2 after each epoch. We clip the norm of the gradients at 5 and use dropout of value 0.35. For HLF, we used the value of k=4. Figure 2c shows the results of compressing the LSTM layers in the PTB (Medium) baseline by 2.5×, 3.33× and 5×. Lower the perplexity, better the model. Pruning achieves the same (sometimes better) perplexity than baseline and other compression techniques. LMF leads to significant loss in perplexity for all compression factors while HLF achieves better perplexity than LMF and faster inference run-time than baseline and pruned networks for all compression factors. In fact, HLF can achieve 17% better perplexity than LMF, 9% better perplexity than small baseline and 2.32× better inference run-time than pruning. The preferred choice of compression scheme for different compression factors will depend on whether slight loss in perplexity can be accommodated for faster inference run-time or not. However, HLF still manages to serve as a more viable alternative to pruning than LMF for inference on edge CPUs.

Text Classification
We use the text classification network in (Zhou et al., 2016) evaluated on the SemEval-2010 dataset. The baseline network has 1 bidirectional LSTM layers with hidden vector of size 256. We used the hyper-parameters in (Zhou et al., 2016) to train the baseline and as the initial hyperparameters ex-plored for the compressed networks. We trained the compressed networks for additional 20 epochs while exploring learning rates of 10× and 1/10 than the baseline learning rate. The baseline model was trained using AdaDelta with a learning rate of 1.0. The model parameters were regularized with L2 regularization strength of 10 − 5. Figure 2b shows the results for compressing the text classification network by 2.5×−5×. HLF can improve the accuracy achieved by LMF by upto 1.2%, by SB by up-to 1.3% and improve the run-time achieved by pruning by almost 1.20×.

Intent Detection and Slot Filling
We used the benchmark published in (Liu and Lane, 2016). This benchmark is trained on the ATIS dataset and jointly trains for intent detection and slot filling. The benchmark uses 1 LSTM layer of size 128 along with attention layers. For HLF, we used the value of k=1. Figure 2b shows the results for slot filling task. HLF can improve the F1-accuracy achieved by LMF and SB by up-to 1.2% and improve the run-time achieved by pruning by up-to 1.26×. Figure 2b shows the result for the intent classification task. Due to joint training, the network used for slot filling and intent classification is the same. As a result, the runtime improvement of HLF over pruning is exactly the same as for the slot filling task. Additionally, HLF improves the intent classification accuracy by up to 1% over LMF and small baseline.

Compressing Word Embedding layers using HLF
We compressed the input word-embedding layers in the PTB-LM model discussed in section 4.2.2, without compressing other layers in the network. However, even 3× compression using HLF led to 8% loss in perplexity score.

Orthogonality of word embedding compression methods and HLF
We ran experiments where we prune the word embedding layers in the PTB-LM model in section 4.2.2 by 2× while keeping the LSTM layers uncompressed, leading to 83.1 perplexity score. We were able to further compress this network by 2× using HLF with only 1 point loss in perplexity score, indicating that HLF is compatible with techniques used for compressing word-embedding layers.

Discussion
Effectively, HLF acts as an alternative to LMF whenever compression using pruning does not lead to the required run-time benefit and LMF leads to loss in accuracy. HLF has a better accuracy than LMF for most evaluation points, validating the assumption in the paper that rank of a matrix in a RNN is important for better task accuracy in NLP applications. Additionally, HLF has a better accuracy than smaller baseline. This validates the assumption of the importance of constrained features in addition to the richer features in a Small Baseline network.

Limitations
While HLF provides significant benefits over LMF, there are two limitations associated with the technique: • The unique nature of RNNs (Assumption A1-A3) makes HLF a natural fit for LSTM/GRU layers. However, these assumptions are not valid for the final classification layer. In classification layer, HLF will lead to more expressive output for certain classes (the top part of HLF matrix) in the dataset and less expressive output for the rest of the classes (bottom part of HLF matrix).
• HLF is a training aware compression and cannot be applied to a pre-trained network.

Conclusion
Choosing the right compression technique requires looking at three criteria -compression factor, accuracy, and run-time. Pruning is an effective compression technique, but can sacrifice speedup over baseline for certain compression factors. LMF achieves better speedup than baseline for all compression factors, but leads to accuracy degradation. This paper introduces a new compression scheme called HLF, which preserves the dense structures of LMF while effectively doubling the rank of the matrix using an intelligent structure by design. This leads to 2× faster inference run-time than pruning and up-to 16% better accuracy than LMF.