Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings

The current state-of-the-art task-oriented semantic parsing models use BERT or RoBERTa as pretrained encoders; these models have huge memory footprints. This poses a challenge to their deployment for voice assistants such as Amazon Alexa and Google Assistant on edge devices with limited memory budgets. We propose to learn compositional code embeddings to greatly reduce the sizes of BERT-base and RoBERTa-base. We also apply the technique to DistilBERT, ALBERT-base, and ALBERT-large, three already compressed BERT variants which attain similar state-of-the-art performances on semantic parsing with much smaller model sizes. We observe 95.15% 98.46% embedding compression rates and 20.47% 34.22% encoder compression rates, while preserving >97.5% semantic parsing performances. We provide the recipe for training and analyze the trade-off between code embedding sizes and downstream performances.


Introduction
Conversational virtual assistants, such as Amazon Alexa, Google Home, and Apple Siri, have become increasingly popular in recent times. These systems can process queries from users and perform tasks such as playing music and finding locations. A core component in these systems is a task-oriented semantic parsing model that maps natural language expressions to structured representations containing intents and slots that describe the task to perform. For example, the expression Can you play some songs by Coldplay? may be converted to Intent: PlaySong, Artist: Coldplay, and the expression Turn off the bedroom light may be converted to Intent: TurnOffLight, Device: bedroom.
The current state-of-the-art models on SNIPS (Coucke et al., 2018), ATIS (Price, 1990), and Facebook TOP (Gupta et al., 2018) datasets are all based on BERT-style (Devlin et al., 2018;Liu et al., 2019) encoders and transformer architectures Castellucci et al., 2019;Rongali et al., 2020). It is challenging to deploy these large models on edge devices and enable the voice assistants to operate locally instead of relying on central cloud services, due to the limited memory budgets on these devices. However, there has been a growing push towards the idea of TinyAI 1 .
In this paper, we aim to build space-efficient task-oriented semantic parsing models that produce near state-of-the-art performances by compressing existing large models. We propose to learn compositional code embeddings to significantly compress BERT-base and RoBERTa-base encoders with little performance loss. We further use ALBERTbase/large (Lan et al., 2019) and DistilBERT  to establish light baselines that achieve similar state-of-the-art performances, and apply the same code embedding technique. We show that our technique is complementary to the compression techniques used in ALBERT and DistilBERT. With all variants, we achieve 95.15% ∼ 98.46% embedding compression rates and 20.47% ∼ 34.22% encoder compression rates, with >97.5% semantic parsing performance preservation.

BERT Compression
Many techniques have been proposed to compress BERT (Devlin et al., 2018). Ganesh et al. (2020) provide a survey on these methods. Most existing methods focus on alternative architectures in transformer layers or learning strategies.
In our work, we use DistilBERT and ALBERTbase as light pretrained language model encoders for semantic parsing. DistilBERT  uses distillation to pretrain a model that is 40% smaller and 60% faster than BERT-base, while retaining 97% of its downstream performances. ALBERT (Lan et al., 2019) factorizes the embedding and shares parameters among the transformer layers in BERT and results in better scalability than BERT. ALBERT-xxlarge outperforms BERT-large on GLUE (Wang et al., 2018), RACE (Lai et al., 2017), and SQUAD (Rajpurkar et al., 2016) while using less parameters.
We use compositional code learning (Shu and Nakayama, 2017) to compress the model embeddings, which contain a substantial amount of model parameters. Previously ALBERT uses factorization to compress the embeddings. We find more compression possible with code embeddings.

Embedding Compression
Varied techniques have been proposed to learn compressed versions of non-contextualized word embeddings, such as, Word2Vec (Mikolov et al., 2013) and GLoVE (Pennington et al., 2014). Subramanian et al. (2018) use denoising k-sparse autoencoders to achieve binary sparse intrepretable word embeddings.  achieve sparsity by representing the embeddings of uncommon words using sparse linear common combination of common words. Lam (2018) achieve compression by quantization of the word embeddings by using 1-2 bits per parameter. Faruqui et al. (2015) use sparse coding in a dictionary learning setting to obtain sparse, non-negative word embeddings. Raunak (2017) achieve dense compression of word embeddings using PCA combined with a post-processing algorithm. Shu and Nakayama (2017) propose to represent word embeddings using compositional codes learnt directly in end-to-end fashion using neural networks. Essentially few common basis vectors are learnt and embeddings are reconstructed using their composition via a discrete code vector specific to each token embedding. This results in 98% compression rate in sentiment analysis and 94% -99% in machine translation tasks without performance loss with LSTM based models. All the above techniques are applied to embeddings such as WordVec and Glove, or LSTM models.
We aim to learn space-efficient embeddings for transformer-based models. We focus on compositional code embeddings (Shu and Nakayama, 2017) since they maintain the vector dimensions, do not require special kernels for calculating in a sparse or quantized space, can be finetuned with transformerbased models end-to-end, and achieve extremely high compression rate.  explores similar idea as Shu and Nakayama (2017) and experiment with more complex composition functions and guidances for training the discrete codes. Chen and Sun (2019) further show that end-to-end training from scratch of models with code embeddings is possible. Given various pretrained language models, we find that the method proposed by Shu and Nakayama (2017) is straightforward and perform well in our semantic parsing experiments.

Compositional Code Embeddings
Shu and Nakayama (2017) apply additive quantization (Babenko and Lempitsky, 2014) to learn compositional code embeddings to reconstruct pretrained word embeddings such as GloVe (Pennington et al., 2014), or task-specific model embeddings such as those from an LSTM neural machine translation model. Compositional code embeddings E C for vocabulary V consist of a set of M codebooks E C 1 , E C 2 , ..., E C M , each with K basis vectors of the same dimensionality D as the reference embeddings E, and a discrete code vector (C 1 w , C 2 w , ..., C M w ) for each token w in the vocabulary. The final embedding for w is composed by summing up the C i w th vector from the ith codebook as Codebooks and discrete codes are jointly learned using the mean squared distance objective: For learning compositional codes, the Gumbel-softmax reparameterization trick (Jang et al., 2016;Maddison et al., 2016) is used for one-hot vectors corresponding to each discrete code.  Table 1: Model compression with compositional code ("cc") embeddings. The embedding layers are compressed by more than 95% with compositional code embeddings in all of the BERT variants.

Transformer-Based Models with Compositional Code Embeddings
In this work, we learn compositional code embeddings to reduce the size of the embeddings in pretrained contextualized language models. We extract the embedding tables from pretrained RoBERTa-base (Liu et al., 2019), BERT-base (Devlin et al., 2018), DistilBERT-base , ALBERT-large-v2 and ALBERT-base-v2 (Lan et al., 2019) from the huggingface transformers library  and follow the approach presented by Shu and Nakayama (2017) to learn the code embeddings. We then replace the embedding tables in the transformer models with the compositional code approximations and evaluate the compressed language models by finetuning on downstream tasks. When Shu and Nakayama (2017) feed compositional code embeddings into the LSTM neural machine translation model, they fix the embedding parameters and train the rest of the model from random initial values. In our experiments, we fix the discrete codes, initialize the transformer layers with those from the pretrained language models, initialize the task-specific output layers randomly, and finetune the codebook basis vectors with the rest of the non-discrete parameters.

Size Advantage of Compositional Code Embeddings
An embedding matrix E ∈ R |V |×D stored as 32-bit float point numbers, where |V | is the vocabulary size and D is the embedding dimension, requires 32|V |D bits. Its compositional code reconstruction requires 32M KD bits for M K basis vectors, and M log 2 K bits for codes of each of |V | tokens. Since each discrete code takes an integer value in [1, K], it can be represented using log 2 K bits. Table 1 illustrates the size advantage of compositional code embeddings for various pretrained transformer models

Experiments and Analyses
For transformer model training, we base our implementation on the huggingface transformers library v2.6.0 . We use the AdamW optimizer (Loshchilov and Hutter, 2017) with 10% warmup steps and linear learning rate decay to 0. Forr code embedding learning, we base our implementation on that of Shu and Nakayama (2017). By default we learn code embeddings with 32 codebooks and 16 basis vectors per codebook. Unless otherwise specified, hyperparameters are found according to validation performances from one random run. We conduct our experiments on a mixture of Tesla M40, TITAN X, 1080 Ti, and 2080 Ti GPUs. We use exact match (EM) and intent accuracy as evaluation metrics. Exact match requires correct predictions for all intents and slots in a query, and is our primary metric.  (Zhang et al., 2019) 80.9 97.3 BERT-Seq2Seq-Ptr (Rongali et al., 2020) 86.3 98.3 RoBERTa-Seq2Seq-Ptr (Rongali et al., 2020) 87.1 98.0 BERT-Joint (Castellucci et al., 2019) 91.6 99.0 Joint BERT  92  Table 3: Results on SNIPS. "cc" indicate models with code embeddings. "epo" is the epoch number for offline code embedding learning. "lr" and "wd" are the peak learning rate and weight decay for whole model finetuning. "EM-v", "EM", "Intent" indicate validation exact match, test exact match, and test intent accuracy.

SNIPS and ATIS
We implement a joint sequence-level and tokenlevel classification layer for pretrained transformer models. The intent probabilities are predicted as where h 0 is the hidden state of the [CLS] token. The slot probabilities for each token j are predicted as y s j = softmax(W s h j + b s ). We use the cross entropy loss to maximize p(y i |x) p(y s j |x) where j is the first piece-wise token for each word in the query. We learn code embeddings for {500, 700, 900, 1100, 1300} epochs. We train transformer models with original and code embeddings all for 40 epochs with batch size 16 and sequence length 128. Uncased BERT and DistilBERT perform better than the cased versions. We experiment with peak learning rate {2e-5, 3e-5, ..., 6e-5} and weight decay {0.01, 0.05, 0.1}. As shown in Table 3 and 4, we use different transformer encoders to establish strong baselines which achieve EM values that are within 1.5% of the state-of-the-art.
On both datasets, models based on our compressed ALBERT-large-v2 encoder (54MB) perserves >99.6% EM of the previous state-ofthe-art model  which uses a BERT encoder (420MB). In all settings, our compressed encoders perserve >97.5% EM of the uncompressed counterparts under the same training settings. We show that our technique is effective on a variety of pretrained transformer encoders.

Analysis for Code Convergence
We study the relationship among a few variables during code learning for the embeddings from pretrained ALBERT-base (Table 6). During the first 1000 epochs, the mean Euclidean distance between the original and reconstructed embeddings decrease with a decreasing rate. The average number of shared top-20 nearest neighbours according to cosine similarity and Euclidean distances between the two embeddings increase with a decreasing rate. We apply code embeddings trained for different numbers of epochs to ALBERT-base-v2 and finetune on semantic parsing. On SNIPS and ATIS, we find the best validation setting among learning rate {2,3,4,5,6}e-5 and weight decay {0.01, 0.05, 0.01}. We observe that the test exact match plateaus for code embeddings trained for more than 400 epochs. On Facebook TOP, we use learning rate 2e-5 and weight decay 0.01, and observe the similar trend.

Effects of M and K
We use embeddings from pretrained ALBERT-base-v2 as reference to learn code embeddings with M in {8, 16, 32, 64} and K in {16, 32, 64}. As shown in Table 7, after 700 epochs, the MSE loss for embeddings with larger M and K converges to smaller values in general. With M=64, more epochs are needed for convergence to smaller MSE losses compared to those from smaller M. We apply the embeddings to ALBERT-base-v2 and finetune on SNIPS. In general, larger M yields better performances. Effects of K are less clear when M is large.

Conclusion
Current state-of-the-art task-oriented semantic parsing models are based on pretrained RoBERTa-base (478MB) or BERT-base (420MB). We apply Dis-tilBERT (256MB), ALBERT-large (68MB), and  ALBERT-base (45MB), and observe near state-ofthe-art performances. We learn compositional code embeddings to compress the model embeddings by 95.15% ∼ 98.46%, the pretrained encoders by 20.47% ∼ 34.22%, and observe 97.5% performance preservation on SNIPS, ATIS, and Facebook TOP. Our compressed ALBERT-large is 54MB and can achieve 99.6% performances of the previous state-of-the-art models on SNIPS and ATIS. Our technique has potential to be applied to more tasks including machine translation in the future.