SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling

Slot filling and intent detection are two main tasks in spoken language understanding (SLU) system. In this paper, we propose a novel non-autoregressive model named SlotRefine for joint intent detection and slot filling. Besides, we design a novel two-pass iteration mechanism to handle the uncoordinated slots problem caused by conditional independence of non-autoregressive model. Experiments demonstrate that our model significantly outperforms previous models in slot filling task, while considerably speeding up the decoding (up to X 10.77). In-depth analyses show that 1) pretraining schemes could further enhance our model; 2) two-pass mechanism indeed remedy the uncoordinated slots.


Introduction
Slot filling (SF) and intent detection (ID) play important roles in spoken language understanding, especially for task-oriented dialogue system. For example, for an utterance like "Buy an air ticket from Beijing to Seattle", intent detection works on sentence-level to indicate the task is about purchasing an air ticket, while the slot filling focus on words-level to figure out the departure and destination of that ticket are "Beijing" and "Seattle".
In early studies, ID and SF were often modeled separately, where ID was modeled as a classification task, while SF was regarded as a sequence labeling task. Due to the correlation between these two tasks, training them jointly could enhance each other. Zhang and Wang (2016) propose a joint model using bidirectional gated recurrent unit to learn the representation at each time step. Meanwhile, a max-pooling layer is employed to capture the global features of a sentence for intent classification. Liu and Lane (2016) cast the slot filling task as a tag generation problem and introduce a recurrent neural network based encoder-decoder framework with attention mechanism to model it, meanwhile using the encoded vector to predict intent. Goo et al. (2018) and Haihong et al. (2019) dig into the correlation between ID and SF deeper and modeled the relationship between them explicitly. Qin et al. (2019) propagate the token-level intent results to the SF task, achieving significant performance improvement.
Briefly summarized, most of the previous works heavily rely on autoregressive approaches, e.g., RNN based model or seq2seq architecture, to capture the grammar structure in an utterance. And conditional random field (CRF) is a popular auxiliary module for SF task as it considers the correlations between tags. Thus, several state-of-the-art works combine the autoregressive model and CRF to achieve the competitive performance, which therefore are set as our baseline methods. However, for SF task, we argue that identifying token dependencies among slot chunk is enough, and it is unnecessary to model the entire sequence dependency in autoregressive fashion, which leads to redundant computation and inevitable high latency.
In this study, we cast these two tasks jointly as a non-autoregressive tag generation problem to get rid of unnecessary temporal dependencies. Particularly, a Transformer (Vaswani et al., 2017) based architecture is adopted here to learn the representations of an utterance in both sentence and word level simultaneously (Sec. §2.1). The slots and intent labels are predicted independently and simultaneously, achieving better decoding efficiency. We further introduce a two-pass refine mechanism (in Sec. §2.2) to model boundary prediction of each slots explicitly, which also handle the uncoordinated slots problem (e.g., I-song follows B-singer) caused by conditional independence attribute.
Experiments on two commonly-cited datasets Figure 1: Illustration of SlotRefine, where the left and right part indicate the first and second iteration process respectively. In the first pass, wrong slot tagging results are predicted, as shown in the pink dotted box in the figure, and the "B-tags" (beginning tag of a slot) are feeded as additional information with utterance for second iteration. The slot results in the green dotted box are refined results by second pass. Note that the initial tag embedding "O" added to each inputting position is designed for the two-pass mechanism(Sec. §2.2). show that our approach is significantly and consistently superior to the existing models both in SF performance and efficiency (Sec. §3). Our contributions are as follows: • We propose an fast non-autoregressive approach to model ID and SF tasks jointly, named SlotRefine 1 , achieving the state-of-theart on ATIS dataset.
• We design a two-pass refine mechanism to handle uncoordinated slots problem. Our analyses confirm it is a better alternative than CRF in this task.
• Our model infers nearly ×11 faster than existing models (×13 for long sentences), indicating that our model has great potential for the industry and academia.

Proposed Approaches
In this section, we first describe how we model slot filling and intent detection task jointly by an nonautoregressive model. And then we describe the details of the two-pass refine mechanism. The brief scheme of our model is shown in Figure 1, details can be found in the corresponding caption. Note that we follow the common practice (Ramshaw and Marcus, 1995;Zhang and Wang, 2016;Haihong et al., 2019) to use "Inside-outside-beginning (IOB)" tagging format.

Non-Autoregressively Joint Model
We extend the original multi-head Transformer encoder in Vaswani et al. (2017) to construct the model architecture of SlotRefine. Please refer to Vaswani et al. (2017) for the details of Transformer. The main difference against the original Transformer is that we model the sequential information with relative position representations (Shaw et al., 2018), instead of using absolute position encoding. For a given utterance, a special token CLS is inserted to the first inputting position akin to the operation in BERT (Devlin et al., 2019). Difference from that in BERT is the corresponding output vector is used for next sentence classification, we use it to predict the label of intent in SlotRefine. We denote the input sequence as x = (x cls , x 1 , ..., x l ), where l is the utterance length. Each word x i will be embedded into a h-dimention vector to perform the multi-head self-attention computation. Then, the output of each model stack can be formulated as H = (h cls , h 1 , ..., h l ).
To jointly model the representations of ID and SF tasks, we directly concat 2 the representations of h cls and h i before feed-forward computation, and then feed them into the softmax classifier. Specifically, the intent detection and slot filling results are predicted as follows, respectively: where y i and y s i denote intent label of the utterance and slot label for each token i, respectively.
[h cls , h i ] is the concated vector. W and b are corresponding trainable parameters.
The objective of our joint model can be formulated as: The learning objective is to maximize the conditional probability p y i , y s |x , which is optimized via minimizing its cross-entropy loss. Unlike autoregressive methods, the likelihood of each slot in our approach can be optimized in parallel.

Two-pass Refine Mechanism
Due to the conditional independence between slot labels, it is difficult for our proposed nonautoregressive model to capture the sequential dependency information among each slot chunk, thus leading to some uncoordinated slot labels. We name this problem as uncoordinated slots problem. Take the false tagging in Figure 2 for example, slot label "I-song" uncoordinately follows "Bsinger", which does not satisfy the Inside-Outside-Beginning tagging format.
To address this problem, we introduce a two-pass refine mechanism. As depicted in the Figure 1, in addition to each token embedding in the utterance, we also element-wisely add the slot tag embedding into the model. In the first pass, the initial slot tags are all setting to "O", while in the second pass, the "B-tags" predicted in the first pass is used as the corresponding slot tag input. These two iterations share the model and optimization goal, thus brings no extra parameters.
Intuitively, in doing so, the model generates a draft in the first pass and tries to find the beginning of each slot chunk. In the second pass, by propagating the utterance again with the predicted "Btags", the model is forced to learn how many identical "I-tags" follow them. Through this process, the slot labels predicted becomes more consistent, and the boundaries are more accurately identified. From a more general perspective, we can view this two-pass process as a trade-off between autoregression and non-autoregression, where the complete markov chain process can be simplified as follow: p y i , y s |x = p y i |x · p ỹ s |y i , x · p y s |ỹ s , y i , x whereỹ s is the tagging results from the first pass. Two-pass refine mechanism is similar to the multi-round iterative mechanism in nonautoregressive machine translation Gu et al., 2018;Ding et al., 2020;Kasai et al., 2020), such as Mask-predict (Ghazvininejad et al., 2019). However, we argue that our method is more suitable in this task. The label dependency of the tagging task (e.g., slot filling) is simple, where we only need to ensure the tagging labels of a slot are consistent from the beginning to the end. Therefore, two iterations to force the model to focus on the slot boundaries is enough in our task, intuitively. Mask-Predict can alleviate the problem caused by conditional independence too. However, it's designed for a more complex goal, and it usually introduce more iterations (e.g., 10 iters) to achieve competitive performance, which largely reduces the inference speed.

Experiment
Datasets We choose two widely-used datasets: ATIS (Airline Travel Information Systems, Tur et al. (2010)) and Snips (collected by Snips personal voice assistant, Coucke et al. (2018)). Compared with ATIS, the Snips dataset is more complex due to its large vocabulary size, cross-domain intents and more out-of-vocabulary words.
Metrics Three evaluation metrics are used in our experiments. F1-score and accuracy are applied for slot filling and intent detection task, respectively. Besides, we use sentence accuracy to indicate proportion of utterance in the corpus whose slots and intent are both correctly-predicted.
Setup All embeddings are initialized with xavier method (Glorot and Bengio, 2010). The batch size is set to 32 and learning rate is 0.001. we set number of Transformer layers, attention heads and hidden sizes to {2,8,64} and {4,16,96} for ATIS and Snips datasets. In addition, we report the results of previous studies (Hakkani-Tür et al., 2016;Liu and Lane, 2016;Goo et al., 2018;Haihong et al., 2019;Qin et al., 2019) and conduct speed evaluation based on their open-source codes. Table 1 summarizes the model performance on ATIS and snips corpus. It can be seen that SlotRefine consistently outperforms other baselines in all three metrics. Compared with our basic non-autoregressive joint model in Section § 2.1, SlotRefine achieve +1.18 and +1.55 sentence-level accuracy improvements for ATIS and Snips, respectively. It is worthy noting that our SlotRefine significantly improves the slot filling task (F1-score↑). we attribute the improvement to that our two-pass mechanism successfully makes the model learn better slot boundaries.

Main Results
Speedup As each slot tagging result can be calculated in parallel with our approach, inference latency can be significantly reduced. As shown in Table 2, on ATIS test set, our non-autoregressive model could achieve ×8.80 speedup compared with the existing state-of-the-art model (Haihong et al., 2019). And after introducing two-pass mechanism (SlotRefine), our model still achieves competitive inference speedup (×4.31). Our decoding  Table 2: "Latency" is the average time to decode an utterance without minibatching. "Speedup" is compared against existing SOTA model (Haihong et al., 2019). is conducted with a single Tesla P40 GPU. It is worth noting that for long sentences (Length≥12), the speedup achieves ×13 (not reported in table).

Two-Pass
Mechanism v.s. CRF In SF task, CRF is usually used to learn the dependence of slot labels. Two most important dependence rules CRF learned can be summarized as tag O can only be followed by O or B and tag B-* can only be followed by same-type label I-* or O, which can be perfectly addressed with our proposed two-pass mechanism. Experiments about +CRF can be found in Table 1&2 ("Our Joint Model +CRF"), we can see that two-pass mechanism equipped SlotRefine outperforms +CRF by averagely +0.89, meanwhile preserving ×2.8 speedup, demonstrating that twopass mechanism can be a better substitute for CRF in this task for better performance and efficiency.

Remedy Uncoordinated Slots in Training
We visualize the number decrease of uncoordinated slots of the training process on ATIS dataset. As depicted in Figure 3, uncoordinated errors of both "One-Pass" and "Two-Pass" models decrease with training goes. Notably, the uncoordinated slots number of Two-Pass model drops significantly  faster than the One-Pass model, achieving better convergence than +CRF after 50 epochs. This indicates that our proposed two-pass mechanism indeed remedy the uncoordinated slots problem, making the slot filling more accurate.

SlotRefine with Pretraining
Recently, there are also some works based on large scale pretraining model BERT , where billions of external corpus are used and tremendous of model parameters are introduced. The number of parameters of BERT is many orders of magnitude more than ours, thus it is unfair to compare performance of SlotRefine with them directly. To highlight the effectiveness of SlotRefine, we conduct experiments with two pretraining schemes, GloVe 3 and BERT 4 , to compare with them. We find that both GloVe and BERT could further enhance the SlotRefine, and it worth noting that "SlotRefine w/ BERT" outperforms existing pretraining based models. The detailed comparison can be found in Table 3.
For the pre-training scheme of BERT, we follow the setting in  and equip twopass mechanism in the fine-tune stage, where CLS token is used for intent detection. And for the pretraining scheme of GloVe, we fix and compress the pretrained word vectors into the same dimension of the input hidden size in SlotRefine by a dense network. It is worth noting that through such simple pre-training method, SlotRefine can achieve a results very close to the method implemented by BERT. We guess that the benefits of the pre-training methods on this task mainly come from alleviating the Out-of-Vocabulary (OOV) problem. One piece of evidence is, for Snips whose test set has a large number of OOV words, benefits through pre-training are very obvious. However, for the ATIS whose test set has few OOV words, only a small sentence accuracy gain, 0.61 and 1.68 for GloVe and Bert respectivly, is obtained after using the pre-training method.

Conclusion
In this paper, we first reveal an uncoordinated slots problem for a classical language understanding task, i.e., slot filling. To address this problem, we present a novel non-autoregressive joint model for slot filling and intent detection with two-pass refine mechanism (non-autoregressive refiner), which significantly improves the performance while substantially speeding up the decoding. Further analyses show that our proposed non-autoregressive refiner has great potential to replace CRF in at least slot filling task.