Accelerating Natural Language Understanding in Task-Oriented Dialog

Task-oriented dialog models typically leverage complex neural architectures and large-scale, pre-trained Transformers to achieve state-of-the-art performance on popular natural language understanding benchmarks. However, these models frequently have in excess of tens of millions of parameters, making them impossible to deploy on-device where resource-efficiency is a major concern. In this work, we show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. Moreover, we perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.


Introduction
The advent of smart devices like Amazon Alexa, Facebook Portal, and Google Assistant has increased the necessity of resource-efficient taskoriented systems (Coucke et al., 2018;Zhang et al., 2020;Desai et al., 2020). These systems chiefly perform two natural language understanding tasks, intent detection and slot filling, where the goals are to understand what the user is trying to accomplish and the metadata associated with the request, respectively (Gupta et al., 2018). However, there remains a disconnect between state-of-theart task-oriented systems and their deployment in real-world applications. Recent top performing systems have largely saturated performance on ATIS (Hemphill et al., 1990) and Snips (Coucke et al., 2018) by leveraging complex neural architectures and large-scale, pre-trained language models (Devlin et al., 2019), but their usability in on-device settings remains suspect (Qin et al., 2019;Cheng et al., 2017). Mobile phones, for example, have sharp hardware constraints and limited memory capacities, implying systems must optimize for both accuracy and resource-efficiency as possible to be able to run in these types of environments (Lin et al., 2010;McIntosh et al., 2018).
In this work, we present a vastly simplified, single-layer convolutional model (Kim, 2014;Bai et al., 2018) that is highly compressible but nonetheless achieves competitive results on task-oriented natural language understanding benchmarks. In order to compress the model, we use structured magnitude-based pruning (Anwar et al., 2017;, a two-step approach where (1) entire convolutional filters are deleted according to their 2 norms; and (2) remaining portions of the underlying weight matrix are spliced together. The successive reduction in the number of convolutional output connections permits downstream weight matrices to reduce their number of input connections as well, collectively resulting in a smaller model. Structured pruning and re-training steps are then interleaved to ensure the model is able to reconstruct lost filters that may contribute valuable information. During test-time, however, we use the pruned model as-is without further fine-tuning.
Our simple convolutional model with structured pruning obtains strong results despite having less than 100K parameters. On ATIS, our multi-task model achieves 95% intent accuracy and 94% slot F1, only about 2% lower than BERT (Devlin et al., 2019). Structured pruning also admits significantly faster inference: on CPUs, we show our model is 63× faster than DistilBERT. Unlike compression methods based on unstructured pruning (Frankle and Carbin, 2019), our model enjoys speedups without having to rely on a sparse tensor library at test-time (Han et al., 2016), thus we demonstrate the potential for usage in resource-constrained, ondevice settings. Our code is publicly available at https://github.com/oja/pruned-nlu.

Related Work
Task-Oriented Dialog. Dialog systems perform a range of tasks, including language understanding, dialog state tracking, content planning, and text generation (Bobrow et al., 1977;Henderson, 2015;Yu et al., 2016;Yan et al., 2017;. For smart devices, specifically, intent detection and slot filling form the backbone of natural language understanding (NLU) modules, which can either be used in single-turn or multi-turn conversations (Coucke et al., 2018;Rastogi et al., 2020). We contribute a single-turn, multi-task NLU system especially tailored for on-device settings, as demonstrated through acceleration experiments.
Model Compression. In natural language processing, numerous works have used compression techniques like quantization (Wróbel et al., 2018;Zafrir et al., 2019), distillation (Sanh et al., 2019;Tang et al., 2019;Jiao et al., 2020), pruning (Yoon et al., 2018;Gordon et al., 2020), and smaller representations Desai et al., 2020). Concurrently, Desai et al. (2020) develop lightweight convolutional representations for on-device task-oriented systems, related to our goals, but they do not compare with other compression methods and solely evaluate on a proprietary dataset. In contrast, we compare the efficacy of structured pruning against strong baselines-including BERT (Devlin et al., 2019)on the open-source ATIS and Snips datasets.

Convolutional Model
Convolutions for On-Device Modeling. Stateof-the-art task-oriented models are largely based on recurrent neural networks (RNNs) (Wang et al., 2018) or Transformers (Qin et al., 2019). However, these models are often impractical to deploy in lowresource settings. Recurrent models must sequentially unroll sequences during inference, and selfattention mechanisms in Transformers process sequences with quadratic complexity (Vaswani et al., 2017). High-performing, pre-trained Transformers, in particular, also have upwards of tens of millions of parameters, even when distilled (Tang et al., 2019;Sanh et al., 2019).
Convolutional neural networks (CNNs), in contrast, are highly parallelizable and can be significantly compressed with structured pruning , while still achieving competitive performance on a variety of NLP tasks (Kim, 2014;Gehring et al., 2017). Furthermore, the core convolution operation has enjoyed speedups with dedicated digital signal processors (DSPs) and field programmable gate arrays (FPGAs) (Ahmad and Pasha, 2020). Model compatibility with on-device hardware is one of the most important considerations for practitioners as, even if a model works well on high throughput GPUs, its components may saturate valuable resources like memory and power (Lin et al., 2010).
Model Description. Model inputs are encoded as a sequence of integers w = (w 1 , · · · , w n ) and right-padded up to a maximum sequence length. The embedding layer replaces each token w i with a corresponding d-dimensional vector e i ∈ R d sourced from pre-trained GloVe embeddings (Pennington et al., 2014). A feature map c ∈ R n−h+1 is then calculated by applying a convolutional filter of height h over the embedded input sequence. We apply max-over-time poolingĉ = max(c) (Collobert et al., 2011) to simultaneously reduce the dimensionality and extract the most salient features. The pooled features are then concatenated and fed through a linear layer with dropout (Srivastava et al., 2014). The objective is to maximize the log likelihood of intents, slots, or both (under a multi-task setup), and is optimized with Adam (Kingma and Ba, 2015).
To ensure broad applicability, our model emphasizes simplicity, and therefore minimizes the number of extraneous architectural decisions: there is only a single convolutional block, no residual connections, and no normalization layers.
Temporal Padding. The model described above is capable of predicting an intent that encompasses the entire input sequence, but cannot be used for sequence labeling tasks, namely slot filling. To create a one-to-one correspondence between input tokens and output slots, Bai et al. (2018) left-pad the input sequence by k − 1, where k is the kernel size. We modify this by loosening the causality constraint and instead padding each side by k−1 2 . Visually, this results in a "centered" convolution that inculcates bidirectional context when computing a feature map. Note that this padding is unnecessary for intent detection, therefore we skip it when training a single-task intent model.
Multi-Task Training. Intent detection and slot filling can either be disjointly learned with dedicated single-task models or jointly learned with a unified multi-task model (Liu and Lane, 2016). In the latter model, we introduce task-specific heads on top of the common representation layer and simultaneously optimize both objectives: Empirically, we observe that weighting L slot more than L intent results in higher performance (α ≈ 0.2). Our hypothesis is that, because of the comparative difficulty of the slot filling task, the model is required to learn a more robust representation of each utterance, which is nonetheless useful for intent detection.

Structured Pruning
Structured vs. Unstructured Pruning. Pruning is one compression technique that removes weights from an over-parameterized model (Le-Cun et al., 1990), often relying on a heuristic function that ranks weights (or groups of weights) by their importance. Methods for pruning are broadly categorized as unstructured and structured: unstructured pruning allows weights to be removed haphazardly without geometric constraints, but structured pruning induces well-defined sparsity patterns, for example, dropping entire filters in a convolutional layer according to their norm (Molchanov et al., 2016;Anwar et al., 2017). Critically, the model's true size is not diminished with unstructured pruning, as without a sparse tensor library, weight matrices with scattered zero elements must still be stored (Han et al., 2016). In contrast, structurally pruned models do not rely on such libraries at test-time since non-zero units can simply be spliced together.
Pruning Methodology. The structured pruning process is depicted in Figure 1. In each pruning iteration, we rank each filter by its 2 norm, greedily remove filters with the smallest magnitudes, and splice together non-zero filters in the underlying weight matrix. The deletion of a single filter results in one less output channel, implying we can also remove the corresponding input channel of the subsequent linear layer with a similar splicing operation. Repetitions of this process result in an objectively smaller model because of reductions in the convolutional and linear layer weight matrices. Furthermore, this process does not lead to irregular sparsity patterns, resulting in a general speedup on all hardware platforms. Figure 1: Structured pruning of convolutional models by (1) ranking filters by their 2 norm, then (2) splicing out the lowest norm filter, resulting in a successively smaller weight matrix. Because each filter convolves input filters c in into one output filter c out , removing a single filter results in c out − 1 output channels.
The heuristic function for ranking filters and whether to re-train the model after a pruning step are important hyperparameters. We experimented with both 1 and 2 norms for selecting filters, and found that 2 slightly outperforms 1 . More complicated heuristic functions, such as deriving filter importance according to gradient saliency (Persand et al., 2020), can also be dropped into our pipeline without modification.
One-Shot vs. Iterative Pruning. Furthermore, when deciding to re-train the model, we experiment with one-shot and iterative pruning (Frankle and Carbin, 2019). One-shot pruning involves repeatedly deleting filters until reaching a desired sparsity level without re-training, whereas iterative pruning interleaves pruning and re-training, such that the model is re-trained to convergence after each pruning step. These re-training steps increase overall training time, but implicitly help the model "reconstruct" deleted filter(s), resulting in significantly better performance. During test-time, however, the pruned model uses significantly fewer resources, as we demonstrate in our acceleration experiments.

Tasks and Datasets
We build convolutional models for intent detection and slot filling, two popular natural language understanding tasks in the task-oriented dialog stack. Intent detection is a multi-class classification problem, whereas slot filling is a sequence labeling problem. Formally, given utterance tokens w = (w 1 , · · · , w n ), models induce a joint distribution P (y * intent , y * slot |w) over an intent label y * intent and slot labels y * slot = (y * (1) slot , · · · , y * (n) slot ). These models are typically multi-task: intent and slots predictions are derived with task-specific heads but share a common representation (Liu and Lane, 2016). However, since the intent and slots of an utterance are independent, we can also learn single-task models, where an intent model optimizes P (y * intent |w) and a slot model optimizes P (y * slot |w). We experiment with both approaches, although our ultimate compressed model is multitasked as aligned with on-device use cases.
Following previous work, we evaluate on ATIS (Hemphill et al., 1990) and Snips (Coucke et al., 2018), both of which are single-turn dialog benchmarks with intent detection and slot filling. ATIS has 4,478/500/893 train/validation/test samples, respectively, with 21 intents and 120 slots. Snips has 13,084/700/700 samples with 7 intents and 72 slots. Our setup follows the same preparation as Zhang et al. (2019).

Experiments and Results
We evaluate the performance, compression, and acceleration of our structured pruning approach against several baselines. Note that we do not employ post-hoc compression methods like quantization (Guo, 2018), as they are orthogonal to our core method, and can be utilized at no additional cost to further improve performance on-device.  Table 2: ATIS performance of multi-task models compressed with structured pruning (ours) and knowledge distillation (Hinton et al., 2015) as the compression rate (CR; %) increases. We report intent accuracy and slot F1. Darker shades of red indicate higher absolute performance drops with respect to 100%.

Benchmark Results
We experiment with both single-task and multi-task models, with and without structured pruning, on ATIS and Snips. The results are displayed in Table  1. Our multi-task model with structured pruning, even with over a 50% reduction in parameters, performs on par with our NO COMPRESSION baselines.
On ATIS, our model is comparable to SLOT-GATED RNN (Goo et al., 2018) and is only about 2% worse in accuracy/F1 than BERT. However, we note that our model's slot F1 severely drops off on Snips, possibly because it is a much larger dataset spanning a myriad of domains. Whether our pre-trained embeddings have sufficient explanatory power to scale past common utterances is an open question. Furthermore, to approximate what information is lost after compression, we analyze which samples' predictions flip from correct to incorrect after structured pruning. We observe that sparser models tend to prefer larger classes; for example, in slot filing, tags are often mislabeled as "outside" in IOB labeling (Tjong and Sang, 2000). This demonstrates a trade-off between preserving non-salient features that work on average for all classes or salient features that accurately discriminate between the most prominent classes. Our model falls on the right end of this spectrum, in that it greedily de-prioritizes representations for inputs that do not contribute as much to aggregate dataset log likelihood.

Comparison with Distillation
In addition, we compare structured pruning with knowledge distillation, a popular compression technique where a small, student model learns from a large, teacher model by minimizing the KL diver- Figure 2: Performance-compression tradeoff curves on ATIS and Snips, comparing multi-task models compressed with structured pruning (ours) and knowledge distillation (Hinton et al., 2015). Pruning curves denote the mean of five compression runs with random restarts. Note that the y-axis ticks are not uniform across graphs. gence between their output distributions (Hinton et al., 2015). Using a multi-task model on ATIS, we compress it with structured pruning and distillation, then examine its performance at varying levels of compression. The results are shown in Table 2. Distillation achieves similar results as structured pruning with 0-50% sparsity, but its performance largely drops off after 80%. Surprisingly, even with extreme compression (99%), structured pruning is about 10% and 20% better on intents and slots, respectively.
Our results show that, in this setting, the iterative refinement of a sparse topology admits an easier optimization problem; learning a smaller model directly is not advantageous, even when it is supervised with a larger model. Furthermore, the iterative nature of structured pruning means it is possible to select a model that optimizes a particular performance-compression trade off along a sparsity curve, as shown in Figure 2. To do the same with distillation requires re-training for a target compression level each time, which is intractable with a large set of hyperparameters.

Acceleration Experiments
Lastly, to understand how our multi-task model with structured pruning performs without significant computational resources, we benchmark its test-time performance on a CPU and GPU. Specifically, we measure several models' inference times on ATIS and Snips (normalized by the total number of test samples) using an Intel Xeon E3-1270  Table 3: Average CPU and GPU inference time (in milliseconds) of baselines (Sanh et al., 2019;Devlin et al., 2019) and our multi-task models on ATIS and Snips.
v3 CPU and NVIDIA GTX 1080-TI GPU. Results are shown in Table 3. Empirically, we see that our pruned model results in significant speedups without a GPU compared to both a distilled model and BERT. DistilBERT, which is a strong approximation of BERT, is still 63× slower than our model. We expect that latency disparities on weaker CPUs will be even more extreme, therefore selecting a model that maximizes both task performance and resource-efficiency will be an important consideration for practitioners.

Conclusion
In this work, we show that structurally pruned convolutional models achieve competitive performance on intent detection and slot filling at only a fraction of the size of state-of-the-art models. Our method outperforms popular compression methods, such as knowledge distillation, and results in large CPU speedups compared to BERT and DistilBERT.