Non-Autoregressive Semantic Parsing for Compositional Task-Oriented Dialog

Semantic parsing using sequence-to-sequence models allows parsing of deeper representations compared to traditional word tagging based models. In spite of these advantages, widespread adoption of these models for real-time conversational use cases has been stymied by higher compute requirements and thus higher latency. In this work, we propose a non-autoregressive approach to predict semantic parse trees with an efficient seq2seq model architecture. By combining non-autoregressive prediction with convolutional neural networks, we achieve significant latency gains and parameter size reduction compared to traditional RNN models. Our novel architecture achieves up to an 81% reduction in latency on TOP dataset and retains competitive performance to non-pretrained models on three different semantic parsing datasets.


Introduction
Advances in conversational assistants have helped to improve the usability of smart speakers and consumer wearables for different tasks. Semantic parsing is one of the fundamental components of these assistants and it helps to convert the user input in natural language to a structure representation that can be understood by downstream systems. Majority of the semantic parsing systems deployed on various devices, rely on server-side inference because of the lower compute/memory available on these edge devices. This poses a few drawbacks such as flaky user experience with spotty internet connectivity and compromised user data privacy due to the dependence on a centralized server to which all user interactions are sent to. Thus, semantic parsing on-device has numerous advantages. For the semantic parsing task, the meaning representation used decides the capabilities of the system built. Limitations of the representation with one intent and slot labels were studied in the context of nested queries and multi turn utterances in Aghajanyan et al. (2020) and . New representations were proposed to overcome these limitations and sequence-to-sequence models were proposed as the solution to model these complex forms. But using these new models in realtime conversational assistants still remains a challenge due to higher latency requirements. In our work, we propose a novel architecture and generation scheme to significantly improve the end2end latency of sequence-to-sequence models for the semantic parsing task.
Due to the autoregressive nature of generation in sequence-to-sequence semantic parsing models, the recurrence relationship between target tokens creates a limitation that decoding cannot be parallelized.
There are multiple works in machine translation which try to solve this problem. These approaches relax the decoder token-by-token generation by allowing multiple target tokens to be generated at once. Fully non-autoregressive models (Gu et al., 2017;Ma et al., 2019;Ghazvininejad et al., 2020a;Saharia et al., 2020) and conditional masked language models with iterative decoding Gu et al., 2019;Ghazvininejad et al., 2020b) are some of them.
To enable non-autoregressive generation in semantic parsing, we modify the objective of the standard seq2seq model to predict the entire target structure at once. We build upon the CMLM (Conditional Masked Language Model)  and condition the generation of the full target structure on the encoder representation. By eliminating the recurrent relationship between individual target tokens, the decoding process can be parallelized. While this drastically improves latency, the representation of each token is still dependent on previous tokens if we continue to use an RNN architecture. Thus, we propose a novel model architecture for semantic parsing based on convolutional networks (Wu et al., 2019b) to solve this issue.
Our non-autoregressive model achieves up to an 81% reduction in latency on the TOP dataset , while achieving 80.23% exact match accuracy. We also achieve 88.16% exact match accuracy on DSTC2 (Henderson et al., 2014) and 80.86% on SNIPS (Coucke et al., 2018) which is competitive to prior work without pretraining.
To summarize, our two main contributions are: • We propose a novel alternative to the traditional autoregressive generation scheme for semantic parsing using sequence-to-sequence models. With a new model training strategy and generation approach, the semantic parse structure is predicted in one step improving parallelization and thus leading to significant reduction in model latency with minimal accuracy impact. We also study the limitations of original CMLM  when applied for conversational semantic parsing task and provide motivations for our simple yet critical modifications.
• We propose LightConv Pointer, a model architecture for non-autoregressive semantic parsing, using convolutional neural networks which provides significant latency and model size improvements over RNN models. Our novel model architecture is particularly suitable for limited compute use-cases like ondevice conversational assistants.

Method
In this section, we propose a novel, convolutional, non-autoregressive architecture for semantic parsing. While non-autoregressive decoding has been previously explored in machine translation, we describe how it can be applied to semantic parsing with several critical modifications to retain performance. We then describe our convolutional architecture. By incorporating these advances together, our approach achieves both high accuracy and efficient decoding.
The task is to predict the semantic parse tree given the raw text. We use the decoupled representation (Aghajanyan et al., 2020), an extension of the compositional form proposed in  for task oriented semantic parsing. Decoupled representation is obtained by removing all text in the compositional form that does not appear in a leaf slot. Efficient models require representations which are compact, with least number of tokens, to reduce number of floating point operations during inference. Decoupled representation was found to be suitable due to this. Figure 1 shows the semantic parse for a sample utterance. Our model predicts the serialized representation of this tree which is

Non-Autoregressive Decoding
While autoregressive models (Figure 2), which predict a sequence token by token, have achieved strong results in various tasks including semantic parsing, they have a large downside. The main challenge in practical applications is the slow decoding time. We investigate how to incorporate recent advances in non-autoregressive decoding for efficient semantic parsing models.
We build upon the Conditional Masked Language Model (CMLM) proposed in  by applying it to the structured prediction task of semantic parsing for task-oriented dialog.  uses CMLM to first predict a token-level representation for each source token and a target sequence length; then the model predicts and iterates on the target sequence prediction in a non-autoregressive fashion. We describe our changes and the motivations for these changes below.
One of the main differences between our work and  is that target length prediction plays a more important role in semantic parsing. For the translation task, if the target length is off by one or more, the model can slightly rephrase the sentence to still return a high quality translation. In our case, if the length prediction is off by even one, it will lead to an incorrect semantic parse.
To resolve this important challenge, we propose a specialized length prediction module that more accurately predicts the target sequence length. While  uses a special CLS token in the source sequence to predict the target length, we have a separate module of multiple layers of CNNs with gated linear units to predict the target sequence length (Wu et al., 2019b). We also use label smoothing and differently weighing losses as explained in section 2.3, to avoid the easy overfitting in semantic parsing compared to translation.
As shown in Aghajanyan et al. (2020), transformers without pre-training perform poorly on TOP dataset. The architectural changes that we propose to solve the data efficiency can be found in the section 2.2.1.
Further, we find that the random masking strategy proposed in  works poorly for semantic parsing. When we use the same strategy for the semantic parsing task where the output has a structure, model is highly likely to see invalid trees during training as masking random tokens in the linearized representation of a tree mostly gives invalid tree representations. This makes it hard for the model to learn the structure especially when the structure is complicated (in the case of trees, deep trees were harder to learn). To remedy this problem, we propose a different strategy for model training where all the tokens in the target sequence are masked during training.
Ablation experiments for all the above changes can be found in section 4.3.

Model Architecture
Our model architecture ( Figure 3) is based on the classical seq2seq model (Sutskever et al., 2014) and follows the encoder-decoder architecture. In order to optimize for efficient encoding and decoding, we look to leverage a fully parallel model architecture. While transformer models are fully parallel and popular in machine translation (Vaswani et al., 2017), they are known to perform poorly in low resource settings and require careful tuning using techniques like Neural Architecture Search to get good performance (van Biljon et al., 2020;Murray et al., 2019). Similarly, randomly initialized transformers performed poorly on TOP dataset achieving only 64.5 % accuracy when SOTA was above 80% (Aghajanyan et al., 2020). We overcome this limitation by augmenting Transformers with Convolutional Neural Networks. Details of our architecture are explained below.
For token representations, we use word embeddings concatenated with the sinusoidal positional embeddings (Vaswani et al., 2017). Encoder and decoder consist of multiple layers with residual connections as shown in Figure 4.
First sub-block in each layer consists of MHA (Vaswani et al., 2017). In decoder, we do not do masking of future tokens during model training. This is needed for non-autoregressive generation of target tokens during inference.
Second sub-block consists of multiple convolutional layers. We use depthwise convolutions with weight sharing (Wu et al., 2019b). Convolution layer helps in learning representation for tokens for a fixed context size and multiple layers helps with bigger receptive fields. We use non-causal convolutions for both encoder as well as decoder.
Third sub-block is the FFN (Vaswani et al., 2017;Wu et al., 2019b) which consists of two linear layers and relu. The decoder has source-target attention after the convolution layer.
Pointer-Generator Projection layer The decoder has a final projection layer which generates the target tokens from the decoder/encoder representations. Rongali et al. (2020) proposes an idea based Pointer Generator Network (See et al., 2017) to convert the decoder representation to target tokens using the encoder output. Similarly, we use a pointer based projection head, which decides whether to copy tokens from the source-sequence or generate from the pre-defined ontology at every Figure 3: Sequence to Sequence model architecture which uses Non-Autoregressive strategy for generation decoding step (Aghajanyan et al., 2020).
Length Prediction Module Length prediction Module receives token level representations from the encoder as input. It uses stacked CNNs with gated linear units and mean pooling to generation the length prediction.

Inference
Suppose the source sequence is of length L and source tokens in the raw text are s 1 , s 2 , s 3 . . . s L . Encoder generates a representation of for each token in the source sequence.
The length prediction module predicts the target sequence length using the token level encoder representation.
Using the predicted length T, we create a target sequence of length T consisting of identical MASK tokens. This sequence is passed through possibly multiple decoder layers and generates a representation for each token in the masked target sequence.
We make a strong assumption that each token in the target sentence is conditionally independent of each other given the source and the target length. Thus, the individual probabilities for each token is P (y i |X, T ) where X is the input sequence and T is the length of target sequence.
Beam Search During inference, length prediction module explained in 2.2.1 predicts top k lengths. For each predicted length, we create a decoder input sequence of all masked tokens. This is similar to the beam search with beam size k in autoregressive systems. The main difference in our model architecture is that we expect only one candidate for each predicted length. These all masked sequences are given as input to the model and the model predicts target tokens for each masked token. Once we have predicted target sequences for k different lengths, they are ranked based on the ranking algorithm described in (5), where X is the input sequence and Y is the predicted output sequence, note the predicted token y i is conditioned on both the sequence (X) and the predicted target length T .

Training
During training, we jointly optimize for two weighted losses. The first loss is calculated for the predicted target tokens against the real target and the second loss is calculated for predicted target length against real target length.
During forward-pass, we replace all the tokens in the target sequence with a special <MASK> token and give this as an input to the decoder. Decoder predicts the token for each masked token and the cross-entropy loss is calculated for each predicted token.
The length prediction module in the model predicts the target length using the encoder representation. Similar to CMLMs in , length prediction is modeled as a classification task with class labels for each possible length. Cross entropy loss is calculated for length prediction. For our semantic parsing task, label smoothing (Szegedy et al., 2015) was found to be very critical as the length prediction module tends to easily overfit and strong regularization methods are needed. This was because length prediction was a much well-defined task compared to predicting all the tokens in the sequence.
Total loss was calculated by taking a weighted sum of cross entropy loss for labels and length, with lower weight for length loss.
As training progresses through different epochs, the best model is picked by comparing the exact match (EM) accuracy of different snapshots on validation set.

Datasets
We use 3 datasets across various domains to evaluate our semantic parsing approach. Length distribution of each dataset is described using median, 90th percentile and 99th percentile lengths.
TOP Dataset Task Oriented Parsing ) is a dataset for compositional utterances in the navigation and events domains. The training set consists of 31, 279 instances and the test set consists of 9, 042. The test set has a median target length of 15, P90 27 and P99 39.
SNIPS The SNIPS (Coucke et al., 2018) dataset is a public dataset used for benchmarking semantic parsing intent slot models. This dataset is considered flat, since it does not contain compositional queries and can be solved with word-tagging models. Recently, however seq2seq models have started to out perform word-tagging models (Rongali et al., 2020;Aghajanyan et al., 2020). The training set consists of 13, 084 instances, the test set consists of 700 instances. The test set has a median target length of 11, P90 17, P99 21.
DSTC2 Dialogue State Tracking Challenge 2 (Henderson et al., 2014), is a dataset for conversational understanding. The dataset involves users searching for restaurants, by specifying constraints such as cuisine type and price range, we encode these constraints as slots and use this to formulate the decoupled representation. The training set consists of 12, 611 instances and a test set of 9890.
The test set has a median target length of 6, P90 9 and P99 10.

Evaluation
Semantic Parsing Performance For all our datasets, we convert the representation of either the compositional form or flat intent slot form to the decoupled representation (Aghajanyan et al., 2020) . We compare the model prediction with the serialized structure representation and look for exact match (EM).
Benchmarking Latency For the latency analysis for the models trained from scratch: AR Light-Conv Pointer, NAR LightConv Pointer, and BiL-STM. We chose these 3 architectures, to compare NAR vs AR variants of LightConv Pointer, as well as the best performant baseline: Pointer BiL-STM (Aghajanyan et al., 2020). We use Samsung Galaxy S8 with Android OS and Octa-core processor. We chose to benchmark latency to be consistent with prior work on on-device modeling (Wu et al., 2019a;Howard et al., 2019). All models are trained in PyTorch (Paszke et al., 2019) and exported using Torchscript. We measure wall clock time as it is preferred instead of other options because it relates more to real world inference. 1 Latency results can be found in section 4.2.

Baselines
For each of our datasets, we report accuracy metrics on the following models: AR LightConv Pointer: Autoregressive (AR) LightConv Pointer model to establish an autoregressive baseline of our proposed architecture. NAR LightConv Pointer: A non-autoregressive (NAR) variant of the above model to allow for parallel decoding. We compare against the best reported numbers across datasets where the models don't use pretraining.

Model Training Details
During training of our model we use the same base model across all datasets and sweep over hyper parameters for the length module and the batch size and learning rate, an equivalent sweep was done for the AR variant as well. The base model we use for NAR LightConv Pointer model uses 5 encoder layers with convolutional kernel sizes [3,7,15,21,27], where each encoder layer has embedding and convolutional dimensions of 160, 1 self attenion head, and 2 decoder layers with kernel sizes [7,27], and embedding dimension of 160, 1 self-attention head and 2 encoder-attention heads.
Our length prediction module leverages a two convolution layers of 512 embedding dimensions and kernel sizes of 3 and 9. and uses hidden dimension in [128,256,512] determined by hyper parameter sweeps. We also use 8 attention heads for the decoupled projection head. For the convolutional layer, we use lightweight convolutions (Wu et al., 2019b) with number of heads set to 2. We train with the Adam (Kingma and Ba, 2014) optimizer, learning rate is selected to be between [0.00007, 0.0004]. If our evaluation accuracy has not increased in 10 epochs, we also reduce our learning rate by a factor of 10, and we employ early stopping if the accuracy has not changed in 20 epochs. We train with our batch size fixed to be 8. We optimize a joint loss for label prediction and length prediction. Both losses consist of label smoothed cross entropy loss (β is the weight of the uniform distribution) (Pereyra et al., 2017), our label loss has β = 0.1 and our length loss has β = 0.5, we also weight our length loss lower, λ = 0.25. For inference, we use a length beam size of k = 5. Our AR variant follows the same parameters however it does not have length prediction and self-attention in encoder and decoder.

Results
We show that our proposed non-autoregressive convolutional architecture for semantic parsing is competitive with auto-regressive baselines and word tagging baselines without pre-training on three different benchmarks and reduces latency up to 81% on the TOP dataset. We first compare accuracy and latency, then discuss model performance by analyzing errors by length, and the importance of knowledge distillation. We do our analysis on the TOP dataset, due to its inherent compositional nature, however we expect our analysis to hold for other datasets as well. Non-compositional datasets like DSTC2 and SNIPS can be modeled by word tagging models making seq2seq models more relevant in the case of compositional datasets.

Accuracy
In table 5a we show our NAR and AR variants for LightConv Pointer perform quite similarly across all datasets. We can see that our proposed NAR LightConv Pointer is also competitive with state of the art models without pre-training: -0.66% TOP, -0.17% DSTC2, -4.57% SNIPS (-0.04% compared to word tagging models). Following the prior work on Non-Autoregressive models, we also report our experiments with sequence-level knowledge distillation in subsection Knowledge Distillation under section. 4.3.

Latency
In figure 5b we show the latency of our model with different generation approaches (NAR vs AR) over increasing target sequence lengths on the TOP dataset. Firstly, we show that our LightConv Pointer is significantly faster than the BiLSTM baseline (Aghajanyan et al., 2020), achieving up to a 54% reduction in median latency. BiLSTM was used as baseline as that was the SOTA without pretraining for TOP and Transformers performed poorly. By comparing our model with AR and NAR generation strategy, it can be seen that increase in latency with increase in target length is much smaller for NAR due to better parallelization of decoder, resulting in up to an 81% reduction in

Analysis
Ablation experiments We compare the modifications proposed by this work (LightConv, Conv length prediction module and Mask everything strategy) with the original model proposed in  in table 1. The motivations for each modification was already discussed in sub-section 2.1. Our mean EM accuracy results based on 3 trials show the significance of techniques proposed in this paper especially for longer target sequences.
Errors by length It is known that nonautoregressive models have difficulty at larger sequence lengths . In table 2, we show our model's accuracy in each respective length bucket on the TOP dataset. We see that the AR and NAR model follow a similar distribution of errors, however the NAR model seems to error at a higher rate for the longer lengths.
Knowledge Distillation Following prior work Zhou et al., 2020), we train our model with sequence-level knowledge distillation (Kim and Rush, 2016). We train our system on data generated by the current SOTA autoregressive models BART (Lewis et al., 2019;Aghajanyan et al., 2020). In table 3 we show the impact of knowledge distillation in our task on both the non-autoregressive and autoregressive variants of LightConv Pointer. These results support prior work in machine translation for distillation of au-  toregressive teachers to non-autoregressive models showing distillation improving our models on TOP and SNIPS, however we notice minimal changes on DSTC2.
The importance of length prediction An important part of our non-autoregressive model is length prediction. In figure 6, we report exact match accuracy @ top k beams and length accuracy @ top k beams (where top K refers to whether the correct answer was in the top K predictions) for the TOP dataset. We can see a tight correlation between our length accuracy and exact match accuracy, showing how our model is bottle necked by the length prediction. Providing gold length as a feature, led to an exact match accuracy of 88.20% (shown in red on figure  6), an absolute 7.31 point improvement over our best result with our non-autoregressive LightConv Pointer.

Related Work
Non-autoregressive Decoding Recent work in machine translation has made a lot of progress in fully non-autoregressive models (Gu et al., 2017;Ma et al., 2019;Ghazvininejad et al., 2020a;Saharia et al., 2020) and parallel decoding (Lee et al., 2018;Gu et al., 2019;Ghazvininejad et al., 2020b;Kasai et al., 2020). While many advancements have been made in machine translation, we believe we are the first to explore the non-autoregressive semantic parsing setting. In our work, we extend the CMLM to work for semantic parsing. We make two important adjustments: first, we use a different masking approach where we mask everything and do onestep generation. Second, we note the importance of the length prediction task for parsing and improve the length prediction module in the CMLM.
Seq2Seq For Semantic Parsing Recent advances in language understanding have lead to increased reliance on seq2seq architectures. Recent work by Rongali et al. 2020;Aghajanyan et al. 2020, showed the advantages from using a pointer generator architecture for resolving complex queries (e.g. composition and cross domain queries) that could not be handled by word tagging models. Since we target the same task, we adapt their pointer decoder into our proposed architecture. However, to optimize for latency and compression we train CNN based architectures (Desai et al. 2020 andWu et al. 2019b) to leverage the inherent model parallelism compared to the BiLSTM model proposed in Aghajanyan et al. 2020 and more compression compared to the transformer seq2seq baseline proposed in Rongali et al. 2020. To further improve latency we look at parallel decoding through nonautoregressive decoding compared to prior work leveraging autoregressive models.

Conclusion
This work introduces a novel alternative to autoregressive decoding and efficient encoder-decoder architecture for semantic parsing. We show that in 3 semantic parsing datasets, we are able to speed up decoding significantly while minimizing accuracy regression. Our model is able to generate parse trees competitive with state of the art autoregressive models with significant latency savings, allowing complex NLU systems to be delivered on edge devices.
There are a couple of limitations of our proposed model that naturally extend themselves to future work. Primarily, we cannot support true beam decoding, we decode a single prediction for each length prediction however there may exist multiple beams for each length prediction. Also for longer parse trees and more complex semantic parsing systems such as session based understanding, our NAR decoding scheme could benefit from multiple iterations. Lastly, though we explored models without pre-training in this work, recent developments show the power of leveraging pre-trained models such as RoBERTa and BART. We leave it to future work to extend our non-autoregressive decoding for pre-trained models.