Extreme Model Compression for On-device Natural Language Understanding

In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task learning. We show our results on a large-scale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes. Our approach outperforms a range of baselines and achieves a compression rate of 97.4% with less than 3.7% degradation in predictive performance. Our analysis indicates that the signal from the downstream task is important for effective compression with minimal degradation in performance.


Introduction
Spoken Language Understanding (SLU) is the task of extracting meaning from a spoken utterance. A typical approach to SLU consists of two modules: an automatic speech recognition (ASR) module that transcribes the audio into a text transcript, followed by a Natural Language Understanding (NLU) module that predicts the semantics (domain, intent and slots) from the ASR transcript. The last few years have seen an increasing application of deep learning approaches to both ASR (Mohamed et al., 2011;Hinton et al., 2012;Graves et al., 2013;Bahdanau et al., 2016) and NLU (Xu and Sarikaya, 2014;Yao et al., 2013;Ravuri and Stolcke, 2015;, making them more reliable, accurate and efficient. This has led to an increasing popularity of feature-rich commercial voice assistants (VAs) -like Amazon Alexa, Google Assistant, Apple's Siri and Microsoft's Cortana. VAs were used in over 3 billion devices in the world in 2019, and are estimated to reach 8 billion devices by 2023 1 . With a growing number of users relying on VAs for their day-to-day activities, voice interfaces have become ubiquitous, and are employed in a range of devices, including smart TVs, mobile phones, smart appliances, home assistants and wearable devices.
The SLU processing for VAs is often offloaded to the cloud, where high-performance, compute-rich hardware is used to serve complex machine learning models. However, on-device SLU is growing in popularity due to its wide applicability and attractive benefits (Coucke et al., 2018;McGraw et al., 2016;Saade et al., 2018). First, it enables VAs to work offline, without an active internet connection, allowing their use in remote areas and on devices with poor or intermittent internet connectivity, for eg. in automobiles. Second, on-device processing reduces latency by eliminating communication over the network, and results in an improved user experience. And third, processing utterances on the edge decreases the load on cloud-services, resulting in reduced cloud hardware requirements and associated costs.
NLU is the task of extracting intents and semantics from user queries. NLU in VAs typically consists of the following sub-tasks -domain classification (DC), intent classification (IC) and named entity recognition (NER). Prior work has shown the effectiveness of recurrent neural models, that jointly model Neural Model Compression: Due to its many practical applications, research on neural model compression has received massive interest in recent years. Existing approaches for general neural model compression include low-precision computation (Vanhoucke et al., 2011;Hwang and Sung, 2014;Anwar et al., 2015), quantization (Chen et al., 2015;Zhou et al., 2017), network pruning (Wen et al., 2016;Han et al., 2015), SVD-based weight matrix decomposition (Xue et al., 2013) and knowledge distillation (Hinton et al., 2015). For neural NLP models, however, larger focus has been on compressing huge word embedding matrices. Embedding compression approaches include quantization (Hubara et al., 2017), binarization (Tissier et al., 2019), dimensionality reduction and matrix factorization methods such as PCA (Raunak, 2017) and SVD (Acharya et al., 2019). An alternative post-training compression approach using deep compositional code learning (DCCL) was also proposed by Shu and Nakayama (2017). This approach learns compressed embedding representations based on additive quantization (Babenko and Lempitsky, 2014) and forms the basis of our task-aware compression approach. In contrast to Shu and Nakayama (2017), we propose a task-aware compression approach, where embedding compression is performed during the task model training, instead of as a post-processing step.

Method
Problem Setup: NLU consists of three component tasks -Domain Classification (DC), Intent Classification (IC) and Named Entity Recognition (NER). DC and IC are sentence classification tasks and determine the domain (e.g. Music) and the intent (e.g. PlayMusic) of the input utterance. NER is a sequence tagging task, where each word in the utterance is assigned a slot tag (e.g. AlbumName, Song-Name etc). The combination of the domain, intent and slots represents the semantic interpretation for the given utterance and is passed on to the downstream application. Our goal is to compress the NLU models, to fit within extreme disk space constraints with minimal degradation in predictive performance. Furthermore, low-latency and inference support for the models are desirable.

NLU Task Model Architecture
Model Architectural Constraints: Our choice of a suitable on-device NLU architecture is largely driven by hardware resource constraints. First, on-device systems come with a strict memory budget, restricting our choices to architectures with fewer parameters. Second, the architectures chosen should not only be amenable to model compression, but should result in minimal degradation in performance on compression. Third, on-device models have rigorous latency targets, requiring fast inference. This restricts our choices to simpler, seasoned architectures, like LSTMs and GRUs, that require fewer layers and FLOPs as opposed to the newer computationally intensive transformer-based architectures like BERT. Moreover, on-device inference engines often lack support for sophisticated layers such as self-attention layers. Driven by these constraints and relying on the considerable effectiveness of recurrent architectures (Hakkani-Tür et al., 2016;Liu and Lane, 2016a;Zhang and Wang, 2016), we use a multi-domain, multi-task RNN model (MT-RNN), built using bi-directional LSTMs ( Figure 1) for performing NLU. We train a single neural model that can jointly perform DC, IC and NER for a given input utterance. Furthermore, in order to reduce inference latency, we use word-level LSTMs as opposed to character or sub-word based models. Architecture Details -Our task model, which we call the MT-RNN model, is shown in Figure 1. It consists of a shared bi-directional LSTM (Bi-LSTM) to extract features shared by all tasks, and taskspecific layers for the classification and tagging tasks. The input to the recurrent layers are pretrained embeddings and are fine-tuned during training. The input to each of the classification components is a sentence representation, obtained by concatenating the final states of the forward-and the backward-LSTM. This is passed on to a fully-connected dense layer with a softmax to predict the domain and intent for the utterance. The tagging layer produces a slot tag for each word in the utterance. The input at each time step consists of the forward-and backward-LSTM states for each word and the output is the slot tag. We choose the popularly used Conditional Random Fields (CRF) layer for NER. The network is trained to minimize a joint NLU loss defined as the sum of the cross-entropy losses for IC and DC and the CRF loss for NER: In the following sections, we describe our approach for compressing the word embeddings and the recurrent components of our MT-RNN model.

Word Embedding Compression
Word embeddings have been shown to be the largest components in an NLP model, owing to large vocabulary sizes and floating point parameters, accounting for >90% of the model sizes (Shu and Nakayama, 2017). Hence, compressing embeddings is crucial for reducing NLP model sizes.
Our approach is based on additive quantization (Babenko and Lempitsky, 2014), which has shown great success in compressing word embeddings, achieving high compression rates (Shu and Nakayama, 2017).

Additive Quantization using Deep Compositional Code Learning
Additive quantization (Babenko and Lempitsky, 2014) aims to approximate vectors by representing them as a sum of basis vectors, called codewords. Originally proposed for image compression and approximate nearest neighbor search, this method has recently been used for post-processing word embedding compression (Chen et al., 2018;Shu and Nakayama, 2017) achieving high compression rates, upwards of 90%, on modest vocabulary sizes.
Let W ∈ R V ×D be the original word embedding matrix, where V denotes the vocabulary size and D denotes the embedding size. Using additive quantization, the original word embedding matrix is compressed into a matrix of integer codes as W c ∈ Z K V ×M , where Z K denotes the set of integers from 1 to K, Z K = {1, 2, . . . , K}. This is achieved using a set of M codebooks, C 1 through C M , C m ∈ R K×D , each containing K codewords of size D. C k m is the k th codeword in the m th codebook. For each word embedding w i in W , the compressed codes can be w ci , where The original word embedding w i is approximated from the codes and codebooks as w i by summing the (z i m ) th codeword in the m th codebook over all codebooks: Shu and Nakayama (2017) propose the deep compositional code learning (DCCL) architecture to learn discrete codes and codebooks for a given word embedding matrix through an unsupervised autoencoding task. In this model, a continuous word vector input, w i ∈ R D is first projected into a lower dimensional space using a linear transformation. This is projected through a second linear layer into M different K-dimensional vectors. Each of these M vectors is passed through a gumbel-softmax activation to get M one-hot vectors, r i m ∈ R 1×K : where f L denotes the linear transformations and σ G denotes the gumbel-softmax activation. The gumbel-softmax activation allows the network to learn discrete codes via gumbel-sampling, while also making the network differentiable, enabling the backpropagation of gradients (Jang et al., 2016).
These one-hot vectors are converted to integer codes corresponding to the input word embedding. In order to reconstruct the word embedding, the following operations are performed:   Shu and Nakayama (2017) propose to use the DCCL architecture to perform post-processing embedding compression, where embeddings are compressed after the downstream task model has been trained. The task model is first initialized with pretrained word embeddings that are fine-tuned during task model training to obtain task-specific embeddings. These are compressed using the DCCL architecture trained on an unsupervised autoencoding task. The input to the autoencoder is the embedding matrix W ∈ R V ×D and the model is trained to minimize the average embedding reconstruction loss (denoted by l(W, W )) for words in the embedding matrix:

Task-agnostic Post-Processing Compression
DCCL is shown to outperform other approaches such as parameter pruning and product quantization on sentence classification and machine translation tasks. Since compression is performed as a post-processing step after the task model is trained, the compression algorithm has no information about the downstream task, making the compression task-agnostic and results in several drawbacks. First, unsupervised post-processing compression treats all words equally for compression. However, in practice, some words may be more important than others for the downstream task. Hence, better reconstructions of more important words may benefit the downstream task. Second, post-processing compression typically is lossy resulting in a degradation in downstream performance since the task model is not adapted to the compression error. We propose a task-aware end-to-end compression approach which aims to address these issues.

Task-aware End-to-End Compression
Our algorithm improves on the above said approach, by training the DCCL a.k.a. the compression model, jointly with the downstream task model (Figure 3). End-to-end training allows the compression model to receive signals about the downstream task, thus adapting the compression to the downstream task. Intuitively, since the compression model now has the information about how the words are used in the downstream task (via the downstream loss), it can spend more network capacity in achieving better reconstructions for more important words. At the same time, the downstream task model also adapts to the lossy reconstructions learned by the compression model, thus improving on the downstream performance. We call this task-aware end-to-end compression, where the compression algorithm takes the downstream task loss into account during embedding compression.
In order to perform task-aware compression with a DCCL model, we replace the original embedding lookup operations in the task model with layers from the DCCL model a.k.a. the compression layers. The input to our model is now a sequence of L word embeddings corresponding to words from the input text utterances. These are passed through the compression layers and are reconstructed, as shown in equation 1, to obtain a sequence of D dimensional word representations corresponding to each word in the input. The word representation is then fed to the recurrent layers in the task model and the remaining network is unchanged. The entire setup is trained end-to-end to minimize the downstream task loss and the gradients are back-propagated through the entire network, including the compression layers. Further, the compression layers can be initialized with pretrained model parameters from the task-agnostic DCCL model, and the NLU layers can be initialized from a trained NLU model.
Training an end-to-end DCCL model is tricky, especially when the number and size of codebooks is large. The stochasticity introduced by gumbel-sampling can easily stray off the training, leading to sub-optimal convergence. For these cases, we ground the training by adding the word embedding reconstruction loss to the downstream task loss as follows: Adding the embedding reconstruction loss not only stabilizes the training, but also provides stronger gradients to the compression layers. Note that unlike task-agnostic compression where all words are treated equally for compression, the embedding reconstruction loss term in task-aware compression considers only the words appearing the in the input batch. This ensures that the words that are more frequent in the training data have better reconstructions, resulting in better downstream performance.

Recurrent Layer Compression
Quantization (Hubara et al., 2017) is a simple and effective technique for model compression. Quantization maps each floating point model paramater to it's closest representative from a pre-chosen set of floating-point values. More concretely, the model parameter range is divided into B equally spaced bins (or buckets), and each parameter is assigned it's closest bin. The bins can be represented by integer indices and require at most log 2 B bits. For instance, with 256 bins, a 32-bit floating point parameter can represented by an integer bin index occupying just 8 bits. We apply post-training 8-bit linear quantization to quantize the recurrent layers of the model. Since 32-bit floating point model parameters are now represented by 8-bit integers, this results in an instant 4× compression. Furthermore, quantization improves model latency, as all the floating point operations are performed using integers. While more sophisticated compression techniques exist for compressing recurrent layers, we found that quantization was extremely effective and resulted in no degradation in performance.

Experiments
In this section we describe the datasets used and our experimental setup for model compression. While our approach is generically applicable to any NLP task that uses word embeddings, we show the effectiveness of our approach on the three NLU tasks -DC, IC, and NER. We show our results on a large scale commercial NLU system trained across a large number of intents with huge vocabularies.
Dataset. We use annotated live traffic data of a large-scale, cloud based, commercial VA system to train our NLU models. Utterances from the live traffic are randomly sampled and anonymized to remove any customer specific information. They are then annotated by skilled annotators for the NLU labels corresponding to the domain, intent and slot labels for each utterance. The training set chosen for our experiments contains millions of utterances spanning 5 domains, and over 150 intents and slots. One of these domains is the 'Out of domain' (or OOD) domain, consisting of utterances not supported by the NLU system. The intent for these utterances is labeled as the 'OODIntent' and the words are given the 'Other' slot tag. Our held-out test set is prepared by randomly sampling 1 million utterances from the live-production traffic, following a similar process. In order to facilitate optimization and early stopping, we also use a validation set of a similar scale.
Evaluation Metrics. We use the following metrics for evaluating the performance on the NLU tasks: Intent Recognition Error Rate (IRER): This is the ratio of number of incorrect interpretations to the total number of utterances. A correct interpretation is when the predicted domain, intent and all slots for an utterances are correct. We compute the IRER only on non-OOD utterances.
Intent Classification Error Rate (ICER): This is the ratio of number of incorrect intent predictions to the total number of utterances.
Domain Classification Error Rate (DCER): This is the ratio of number of incorrect domain predictions to the total number of utterances.
Slot Error Rate (SER): This is the ratio of number of incorrect slot predictions to the total number of slots.
False Accept Rate (FAR): This is the ratio of number of out-of-domain utterances falsely accepted as a supported utterance to the total number of out-of-domain utterances. This metric is mainly used to evaluate the effectiveness of the model in rejecting out-of-domain (or unsupported) utterances.
Along with the above metrics we also compute the sizes of the word embeddings and the MT-RNN task model. We only report relative changes in the above metrics compared to the baseline.
NLU Model Training. We train the NLU task model (the MT-RNN model) described in Section 3.1 using the prepared training dataset (Section 4). We initialize the embeddings with FastText (Joulin et al., 2016) embeddings that have been pretrained on a large corpus of unannotated, anonymized, live utterances. The model is trained to minimize the NLU loss L N LU as described in Section 3.1 and the embeddings are fine-tuned during training. The models are trained for a total of 25 epochs, with early stopping on the validation loss, using Adam optimizer with a learning rate of 0.0001. We further perform a grid search on a range of hyperparameter values for dropout and variational dropout and select the best performing model as our candidate model for compression. This model also serves as our uncompressed baseline.  Baselines. We compare our proposed approach with the following baselines. We use the abbreviations 'TAg.' for 'Task Agnostic' and 'TAw.' for 'Task Aware'.
TAg. SVD: In this approach, large embedding matrices are factorized into matrices of much smaller sizes to produce low-rank approximations of the original embedding matrix, using Singular Value Decomposition (SVD). This is applied as an offline compression method where the embedding matrices are compressed as a post-processing step.
TAw. SVD: Acharya et al. (2019) propose a task-aware SVD-based embedding compression approach, where the embedding matrix is first factorized into lower dimensional matrices using SVD. The factors are then used to initialize a smaller word embedding layer followed by a linear layer, and jointly finetuned with the downstream task model. Stochastic Gradient Descent (SGD) with a learning rate of 0.001 as presented in Acharya et al. (2019) is used for the optimizer.
TAg. DCCL: Task-agnostic compression method proposed by Shu and Nakayama (2017) where the code learning autoencoder described in Section 3.2 is used to compress word-embeddings from the trained NLU model. Since it does not perform joint training of the compression layers with the downstream task, this serves as an ablation test for our proposed task-aware compression approach.
TAg. DCCL + NLU Finetuning: This is another ablation test for our proposed task-aware compression approach. In this approach, task-agnostic compression is performed as in the previous baseline. Once compressed in a task-agnostic way, the embeddings are kept frozen and the downstream task model is fine-tuned to minimize the downstream NLU loss. NLU model fine-tuning is performed with a learning rate of 0.0001 for 5 epochs.
For all SVD-based approaches, we run experiments over a range of values for n where n is the fraction of components retrained in the low-rank SVD approximation. This produces models of different sizes. For all DCCL-based baselines, we train the task-agnostic autoencoder model for 300 epochs (approximately 800k iterations) with a learning rate of 0.0001 using the Adam optimizer. We experiment with a range of values for hyperparameters M and K where M is the number of codebooks and K is the number of basis vectors per codebook. Different values of M and K produce models of different sizes. Implementation details. Our approach is essentially a task-aware version of DCCL (TAw. DCCL). In our method, the compression layers are initialized with the parameters from the trained autoencoder model, obtained as a result of task-agnostic post-processing compression. Similarly, the NLU specific layers are initialized from the trained NLU model. The entire compression model is then trained end-toend to minimize the loss function as mentioned in Section 3.2.3. The model is trained with a learning rate of 0.0001 for 5 epochs. Similar to the above task-agnostic setups, we experiment with a range of values for M and K. We further explore the following additional setups: Without pretraining: In this setup, the compression layers and the task model are jointly trained from scratch and are not initialized from pretrained components. The model is trained to minimize the joint NLU loss without the embedding reconstruction loss. We use the Adam optimizer with a learning rate  Without embedding reconstruction loss: In this approach, we do not add the embedding reconstruction loss to the downstream task loss. The models are, however, initialized from pretrained components, and trained end-to-end for 5 epochs. Table 1 summarizes the impact of various word embedding compression approaches on the downstream IRER metric for a range of compression rates. Compression rate is determined by dividing the uncompressed embedding (or model) size by the compressed embedding (or model) size. We report percentage relative changes 2 to the IRER when compared to the uncompressed baseline. The results presented are for 300 dimensional embeddings. However, similar trends were observed for 100 dimensional embeddings as well.

Results and Analysis
In general, we find that task-aware approaches perform better than task-agnostic post-processing approaches. This is because the task-aware end-to-end compression tunes the compression to the downstream task, while also adjusting the task model parameters to recover performance due to lossy reconstructions. From Table 1 we also find that for any given compression rate, our proposed task-aware DCCL approach has the least degradation in predictive performance when compared to other methods.
Task-aware DCCL outperforms even the best task-agnostic compression baseline (TAg. DCCL + NLU Fine-tuning) by 39-44% at each of the different compression rates. This shows that the loss signal from the downstream task helps performance by not only adapting the task model to the compression, but also by improving compression quality. Moreover, our model at 120× compression rate performs better than the best baseline even at 60× compression rate. In other words, our models are 2× smaller than even the best baseline for a similar performance. We also find that the embedding reconstruction loss added to the downstream task loss helps improve the downstream performance, especially when the compression rate is lower i.e. when the gumbel-sampling layers are larger or more in number.
In order to understand the importance of task-aware compression, we plot the word embedding reconstruction loss (Figure 4) for the top most frequent words in our dataset. As seen in Figure 4, the average reconstruction loss for task-agnostic DCCL remains approximately constant irrespective of frequency of the words, indicating that all words are treated equally. In contrast, task-aware compression reduces the average reconstruction loss for more frequent words indicating that the network capacity is spent to learn better reconstructions for words more important for the downstream task. Note that the model used for the graph is the task-aware DCCL model without the reconstruction loss term.
We also find that DCCL-based approaches consistently performed better than their SVD counterparts, in both task-aware and task-agnostic variants. SVD-based approaches do not perform well beyond a specific compression rate (+7.99% for 1.7× compression). On investigating, we found that word embeddings were full rank matrices, with high singular values for all components, indicating that these components captured high variance. Table 2 presents a summary of the performance of the best models for each of the approaches at around 60× embedding compression rate. 8-bit Bi-LSTM quantization helps reduce the size of the recurrent layers in the models, resulting in a net model compression ratio of 39.5× with a minimal performance degradation of 3.69% when compared to the uncompressed baseline.

Conclusion
In this paper, we present approaches for extreme model compression for performing natural language understanding on resource-constrained device. We use a unified multi-domain, multi-task neural model that performs DC, IC and NER for all supported domains. We discuss model compression approaches to compress the bulkiest components of our models -the word embeddings, and propose a task-aware end-to-end compression method based on deep compositional code learning where we jointly train the compression layers with the downstream task. This approach reduced word embeddings sizes to just a few MB, achieving a word-embedding compression rate of 98.4% and outperforms all other taskagnostic and task-aware embedding compression baselines. We further apply post-training 8-bit linear quantization to compress the recurrent layers of the model. These approaches together result in a net model compression rate of 97.5%, with a minimal performance degradation of 3.64% when compared to the uncompressed model baseline.
DCCL approaches are complementary to other compression approaches such as knowledge distillation and model pruning. While our work demonstrates the effectiveness of task-aware DCCL on the classification and tagging tasks in NLU, the approach itself is generic and can be applied to other NLP tasks that rely on large word-embeddings. As part of future work, we would like to explore the effectiveness of task-aware DCCL on NLP tasks such as machine translation and language modeling. We would also like to explore compression of models with advanced architectures using contextual embeddings.