Simple, Fast, Accurate Intent Classification and Slot Labeling for Goal-Oriented Dialogue Systems

With the advent of conversational assistants, like Amazon Alexa, Google Now, etc., dialogue systems are gaining a lot of traction, especially in industrial setting. These systems typically consist of Spoken Language understanding component which, in turn, consists of two tasks - Intent Classification (IC) and Slot Labeling (SL). Generally, these two tasks are modeled together jointly to achieve best performance. However, this joint modeling adds to model obfuscation. In this work, we first design framework for a modularization of joint IC-SL task to enhance architecture transparency. Then, we explore a number of self-attention, convolutional, and recurrent models, contributing a large-scale analysis of modeling paradigms for IC+SL across two datasets. Finally, using this framework, we propose a class of ‘label-recurrent’ models that otherwise non-recurrent, with a 10-dimensional representation of the label history, and show that our proposed systems are easy to interpret, highly accurate (achieving over 30% error reduction in SL over the state-of-the-art on the Snips dataset), as well as fast, at 2x the inference and 2/3 to 1/2 the training time of comparable recurrent models, thus giving an edge in critical real-world systems.


Introduction
At the core of task-oriented dialogue systems are spoken language understanding (SLU) models, tasked with determining the intent of users' utterances and labeling semantically relevant words at each turn of the conversation.Performance on these tasks, known as intent classification (IC) and slot labeling (SL), upper-bounds the utility of such dialogue systems.A large body of recent research has improved these models through the use of re- current neural networks, encoder-decoder architectures, and attention mechanisms.However, for production dialogue systems in particular, system speed is at a premium, both during training and in real-time inference.
In this work, we propose fully non-recurrent and label-recurrent model paradigms including self-attention and convolution for comparison to state-of-the-art recurrent models in terms of accuracy and speed.To achieve this, we design a framework for joint IC-SL models that is modularized into different components and makes the task agnostic to type of neural network used.This, in turn, makes the model architecture simpler, easy to understand and renders the task network agnostic, allowing for easier plug and play using existing components, such as pre-trained contextual word embeddings (Devlin et al., 2018), etc.This is essential for easier model debugging and quicker experimentation, especially in industrial setting.
Using this framework, we identify three distinct model families of interest: fully recurrent, label-recurrent, and non-recurrent.Recent state-of-the-art models fall into the first category, as encoder-decoder architectures have recurrent encoders to perform word context encoding, and predict slot label sequences using recurrent decoders that use both word and label information as they decode (Hakkani-Tür et al., 2016;Liu and Lane, 2016;Li et al., 2018).In second category, we have 'non-recurrent' networks: fully feed-forward, attention-based, or convolutional models, for example.Lastly, we have a class of label-recurrent models, inspired by structured sequential models like conditional random fields on top of nonrecurrent word contextualization components.In this class of models, slot label decoding proceeds such that label sequences are encoded by a recurrent component, but word sequences are not.
Our contributions are: • A class of label-recurrent convolutional models that achieve state-of-the-art performance on Snips and competitive performance on ATIS while maintaining faster training and inference speeds than fully-recurrent models • A new modular framework for joint IC-SL models that permits the analysis of individual modeling components that decomposes these joint models into separate components for word context encoding, summarization of the sentence into a single vector for intent classification, and modeling of dependencies in the output space of slot label sequences.
• In-depth analysis of different word contextualizations for Spoken Language Understanding task (for instance, providing evidence for the intuition that explicitly focusing on local context is a useful architectural inductive prior for slot labeling)

Prior Work
There is a large body of research in applying recurrent modeling advances to intent classification and slot labeling (frequently called spoken language understanding).Traditionally, for intent classification, word n-grams were used with SVM classifier (Haffner et al., 2003) and Adaboost (Schapire and Singer, 2000).For the SL task, CRFs (Gorin et al., 1997) have been used in the past.
Recently, a larger focus has been on joint modeling of IC and SL tasks.Long short-term memory recurrent neural networks (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit models (Cho et al., 2014) were proposed for slot labeling by Yao et al. (2014) and Zhang and Wang (2016) respectively, while Guo et al. (2014) used recursive neural networks.Subsequent improvements to recurrent neural modeling techniques, like bidirectionality and attention (Bahdanau et al., 2014) were incorporated into IC+SL in recent years as well (Hakkani-Tür et al., 2016;Liu and Lane, 2016).Li et al. (2018) introduced a selfattention based joint model where they used selfattention and LSTM layers along with the gating mechanism for this task.
Non-recurrent modeling for language has been re-visited recently, even as recurrent techniques continue to be dominant.Dilated CNNs (Yu and Koltun, 2015) with CRF label modeling were applied to named entity recognition by Strubell et al. (2017), and earlier were applied to SL by Xu and Sarikaya (2013).Convolutional and attentionbased sentence encoders have been applied in complex tasks, including machine translation, natural language inference, and parsing.(Gehring et al., 2017;Vaswani et al., 2017;Shen et al., 2017;Kitaev and Klein, 2018) We draw from both of these bodies of work to propose a simple yet highly effective family of IC+SL models.

A general framework of joint IC+SL
Intent classification and slot labeling take as input an utterance x 1:T = {x 1 , x 2 , ...x T }, composed of words x i and of length T .Models construct a distribution over intents and slot label sequences given the utterance.One intent is assigned per utterance and one slot label is assigned per word: where c ∈ I, a fixed set of intents, and l i ∈ L, a fixed set of slot labels.Models are trained to minimize the cross-entropy loss between the assigned distribution and the training data.To the end of constructing this distribution, our framework explicitly separates the following components, which are explicitly or implicitly present in all joint IC+SL systems (Figure 1):

Word contextualization
We first assume words are encoded through an embedding layer, providing context-independent word vectors.Overloading notation, we denote the embedded sequence x 1:T , with x i ∈ R dx .
In this component, word representations are enriched with sentential context.Each word x i is assigned a contextualized representation h i .To ease layering these components, we keep the dimensionality the same as the word embeddings; h i ∈ R dx .Our study consists mainly of varying this component across models which are described in detail in Section 4. In all models, we assume independence of intent classification and slot labeling given the learned representations:

Sentence representation
In this component, the output of the word contextualization component is summarized in a single vector, where s ∈ R dx .For all our experiments, we keep this component constant, using a simple attentionlike pooling which is the weighted sum of word contextualization for each position in the sentence.These weights are computed using softmax over these individual word contextualizations While simple, this model permits word contextualization components freedom in how they encode sentential information; for example, selfattention models may spread full-sentence information across all words, whereas 1-directional LSTMs may focus full-sentence information in the last word's vector.

Intent prediction
In this component, the sentence representation is used as features to predict the intent of the utterance.For all experiments, we keep this component fixed as well, using a simple two-layer feedforward block on top of s.

Slot label prediction
In this component, the output of the word contextualization component is used to construct a distribution over slot label sequences for the utterance.We decompose the joint probability of the label sequence given the contextualized word representations into a left-to-right labeling: In our experiments, we explore two models for slot prediction, one fully-parallelizable because of strong independence assumptions, the other permitting a constrained dependence between labeling decisions that we call 'label-recurrent'.

Independent slot prediction
The first is a nonrecurrent model, which assumes indepdencence between all labeling decisions once given h 1:T , as well as independence from all word representations except that of the word being labeled: This model is fully parallelizable on GPU architectures, and the probability of each labeling decision is modeled according to hence, SL prediction features are learned using each contextualized word independently.

Label-recurrent slot prediction
The second class of slot prediction models we consider lead to our classification, 'label-recurrent.'1These models permit dependence of labeling decisions on the sequence of decisions made so far, but keep the independence assumption on the word representations: Notably, this family of models excludes traditional encoder-decoder models, since the decoder component uses labeling decisions l 1:i−1 and earlier word representations h 1:i−1 to influence the predictor features p i,SL .However, it includes models such as CNN-CRF.
The space of label sequences in slot labeling is much smaller than the space of word sequences.This adds minimal computational burden and permits the model to benefit from GPU parallelism during h 1:T computation.
For our experiments, we propose a single labelrecurrent model, which encodes labeling histories l 1:−i using only a 10-dimensional LSTM.First, slot labels are embedded, such that for each l ∈ L, we have l ∈ R d l .An initial tag history state, h tag 0 , is randomly initialized.Each tag decision is fed along with the previous tag history state to the LSTM, which returns the next tag history state: We omit a precise description of the LSTM model here for space, referring the reader to (Hochreiter and Schmidhuber, 1997).
The tag history is used at each prediction step as additional inputs to construct the predictor features p i,SL , replacing Eqn.7 with: where [a; b] denotes concatenation.This model and other label-recurrent models are not only parallelizable more than fully-recurrent models, but also provide an architectural inductive bias, separating modeling of tag sequences from modeling of word sequences.In our experiments, we perform greedy decoding to maintain a high decoding speed.

Word contextualization models
In this section, we describe word contextualization models with the goal of identifying non-recurrent architectures that achieve high accuracy and faster speed than recurrent models.

Feed-forward model
In this model, we set h 1:T = x 1:T + a 1:T , where a 1:T is a learned absolute position representation, with one vector learned per absolute position, as used in (Gehring et al., 2017).While extremely simple, this model provides a useful baseline as a totally context-free model.It also permits us to analyze the contribution of a label-recurrent component in such a context-deprived scenario.

Self-attention models
Recent work in non-recurrent modeling has surfaced a number of variants of attention-based word context modeling.The simplest constructs each h i by incorporating a weighted average of the rest of the sequence, x 1:T \x i .We use a general bilinear attention mechanism with a residual connection while masking out the identity in the attention weights.
In this and all subsequent models, we optionally stack multiple layers, feeding the word representations from each layer into the next; in this case we denote the models ATTN-1L, ATTN-2L, etc.
We also analyze multi-head attention models, drawing from (Vaswani et al., 2017).For a model with k heads, we construct one matrix of the form A ∈ R dx/k for each head, and transform each .., k}.These are passed into the attention equations above, generating context vectors c 1 i , ..., c k i ∈ R dx/k , which are then concatenated to form a vector in R dx .These context layers are usually sent through a linear transformation to combine features between the heads, the output of which is c i , but we found that omitting this combination transformation leads to significantly improved results, so we do so in all experiments.We denote these models K-HEAD ATTN.

Relative position representations
We found in early experiments that the absolute position embeddings in self-attention models are insufficient for representing order.Hence, in all attention models except when explicitly noted, we use relative position representations as follows.We follow Shaw et al. (2018), who improved the absolute position representations of the Transformer model (Vaswani et al., 2017) by learning vector representations of relative positions and incorporating them into the self-attention mechanism as follows: where v f (i,j) is a learned vector representing how the relative positions i and j should be incorporated, and b f (i,j) is a learned bias that determines how the relative position should affect the weight given to position j when contextualizing position i.The function f determines which relative positions to group together with a single relative position vector.Given the generally small datasets in IC+SL, we use the following relative position function, which buckets relative positions together in exponentially larger groups as distance increases, following the results of Khandelwal et al. (2018), that LSTMs represent position fuzzily at long relative distances.
This is similar to the very recent preprint work of Bilan and Roth (2018), who use linearly increasing bucket sizes; we found exponentially increasing sizes to work well compared to the constant bucket sizes of Shaw et al. (2018).

Convolutional models
Convolution incorporates local word context into word representations, where kernel width parameter specifies the total size (in words) of local context considered.Each convolutional layer produces a vector representation of each word, and includes a residual connection, and variance normalization, following (Gehring et al., 2017).
To maintain the dimensionality of h i as R dx , we use a filter count of d x .We vary the number of CNN layers as well as the kernel width, and for all models use a variant known as dilated CNNs.These CNNs incorporate distant context into word representations by skipping an increasing number of nearby words in each subsequent convolutional pass.We use an exponentially increasing dilation size; in the first layer, words of distance 1 are incorporated; at layer two, words of distance 2, then 4, etc.This permits large contexts to be incorporated into word representations while keeping kernel sizes and the number of layers low.

Recurrent models
We also construct a recurrent word contextualization model, more or less identical to encoders of recent state-of-the-art models.We use a bidirectional LSTM to encode word contexts, h 1:T = BiLSTM(x 1:T ).As with all other models, we report the performance of this model with feedforward slot label prediction as well as with labelrecurrent slot label prediction.Though similar to earlier work, both models are new; though the latter is recurrent both in word contextualization and slot label prediction, it is distinct from past models in that the two recurrent components are completely decoupled until the prediction step.

Datasets
We evaluate our framework and models on the ATIS data set (Hemphill et al., 1990) of spoken airline reservation requests and the Snips NLU Benchmark set (Coucke et al., 2018).The ATIS training set contains 4978 utterances from the ATIS-2 and ATIS-3 corpora; the test set consists of 893 utterances from the ATIS-3 NOV93 and DEC94 data sets.The number of slot labels is 127, and the number of intent classes is 18.Only the words themselves are used as input; no additional tags are used.The Snips 2017 dataset is a collection of 16K crowdsourced queries, with about 2400 utterances per each of 7 intents.These intents range from 'Play Music' to 'Get Weather'.Training data contains 13784 utterances and the test data consists of 700 utterances.The utterance tokens are mixed case unlike the ATIS dataset, where all the tokens are lowercased.Total number of slot labels are 72.We use IOB tagging, and split 10% of the train set off to form a development set.Utterances in Snips are, on average, short, with 9.15 words per utterance compared to ATIS' 11.2.However, slot label sequences themselves are longer in Snips, averaging 1.8 tokens per span to ATIS' 1.2, making span-level slot labeling more difficult.For our development experiments, we use the casing and tokenization provided by Snips.Co, but to compare to prior work, in one test experiment we use the lowercased, tokenized version of (Goo et al., 2018) 2 .

Experiments
We evaluate multiple models from each of our model paradigms to help determine what modeling structures are necessary for SLU, and where the best accuracy-speed tradeoffs are.First, we report extensive evaluation across the Snips and ATIS development sets, tracking inference speed and time to convergence along with the usual IC accuracy and SL F1.Second, we pick a small number of our best-performing models to evaluate on ATIS and Snips test sets, to compare against prior work.
For each experiment below, we train until convergence, where convergence is defined by an early stopping criterion with a patience of 30 epochs and an average of development set IC accuracy and token-level SL F1 used as the performance metric.

Modeling study experiments
In our first category of experiments, we evaluate variants of each word contextualization paradigm introduced.
We evaluate one feed-forward word contextualization module (labeled as FEED-FORWARD) to provide a baseline performance.As with all subsequent models, we evaluate this word contextualization module with and without our proposed label-recurrent decoder.This baseline should help us determine the extent to which each dataset requires the modeling of context.
We evaluate 3 convolutional word contextualization modules.The first has 1 layer with a kernel size of 5, and is intended to provide intuition as to whether a relatively large local context can suffi-ciently model SL behavior.We label this model CNN, 5KERNEL, 1L, and name all other CNN models similarly.The next model has 3 layers with kernel size 5, and is dilated.This model incorporates long-distance context hierarchically, and is shorter and wider-per-layer than the otherwisesimilar 3rd CNN model, with 4 layers and kernel size 3.
We evaluate 4 attention-based word contextualization modules.The first is simple, with 1 attention head and 1 layer.Unlike all others we analyze, it does not use relative position embeddings.Thus, this model is word order-invariant except for a simple absolute position embedding.If it improves over FEEDFORWARD, then, it provides strong evidence that semantic information from the context words, irrespective of order, is useful in making tagging decisions.We label this model with the flag NO-POS.To evaluate the utility of relative position embeddings, we also compare a model with 1 head and one layer, labeled ATTN, 1HEAD, 1L.We then test two increasingly complex models, first with 3 layers and 1 head, the second with 3 layers and 2 heads per layer.
We evaluate 2 LSTM-based word contextualization modules; one uses a single LSTM layer, whereas the other stacks a second on top of the first.As with all other models, we test these two models both with independent slot prediction and label-recurrent slot prediction.

Comparison to prior work
For our second category of experiments, we take a few high-performing models from our analysis and evaluate them on the Snips and ATIS test sets for comparison to prior work.For these models, we report not only the average IC accuracy and SL F1 across random initializations, but also the standard deviation and best model, as most work has not reported average values.We keep all hyperparameters fixed across all experiments, potentially hindering performance but providing a stronger analysis of robustness.
Note on pre-trained contextual word embedding: Although our framework easy integration of contextual pre-trained embeddings, like BERT (Devlin et al., 2018) and EMLo (Gardner et al., 2017) by replacing the word contextualization component, however, in order to reduce model obfuscation and to have fair comparison against baselines, we exclude them in our experimentation.

Results and discussion
In this section, we draw from results reported in Table 1, on the development sets of Snips and ATIS.It is easy to see that very little in the way of modeling is necessary for IC task, so we focus our analysis on SL task.We emphasize that ATIS has shorter spans than Snips, averaging 1.2 and 1.8 tokens, respectively, leading to differing modeling requirements.

Minimal modeling for SLU
By analyzing three simple models -FEED-FORWARD, ATTN-1HEAD-1L-NO-POS, and CNN-5KERNEL-1L -we conclude that explicitly incorporating local features is a useful inductive bias for high SL accuracy.The purely feed-forward model achieves 53.59 SL F1 on Snips, whereas one layer of convolution improves that number to 85.88.The story is similar for ATIS SL.However, a single layer of attention without position information fails to improve over the feed-forward model whatsoever which we believe is due to the order-invariant nature of self-attention.This also emphasizes the fact that focusing on local context is useful inductive prior for SL task.
For each of these simple models, switching from independent slot label prediction to labelrecurrent prediction provides large gains on both datasets.We find an approximate 1.3ms/utterance slowdown from using label recurrence across all models.Thus, in terms of accuracy-for-speed, very simple models can achieve much of the results of more expensive models as long as they are labelrecurrent and incorporate local context.

High-performing convolutional models
The larger convolutional models provide very high accuracy while maintaining fast inference and training speeds.In particular, our best CNN model, CNN-5KERNEL-3L, achieves 94.22 SL F1 on Snips, compared to the two-layer LSTM with label-recurrence, which achieves 93.88.The model achieves this modest improvement with over 2x the inference speed, training in under 1/2 the time, and demonstrating even stronger results on the test sets, discussed below.
On ATIS, where utterances are longer but slot label spans are shorter, LSTMs outperform CNNs on the development sets.

Issues with self-attention
Our strongest self-attention model underperforms CNNs and LSTMs on both Snips and ATIS, with a maximum SL of 89.31 and 95.86 on the datasets, respectively.Though self-attention models have seen success in complex tasks with lots of training data, we suggest in this study that they lack the inductive biases to perform well on these small datasets.
Relative position embeddings go a long way in improving self-attention models; adding them to a 1-layer attentional encoder improves ATIS and Snips SL by approximately 24 and 22 points, respectively.We find that adding attention heads does not add considerably to the computational complexity of attention models, while increasing accuracy; thus in a speed-accuracy tradeoff, it is likely better to add heads rather than layers as each layer adds O(n 2 * d x ) additional computations.

Word and label recurrence in LSTMs
Our LSTM word contextualization modules show that with recurrent word context modeling, labelrecurrence is less important.For instance, 2-layer LSTM achieves only .78 increase in SL with label recurrence over independent prediction.

Best models compared to prior work
We report test set results on Snips and ATIS in Tables 2 and 3. Our best models from our validation study, CNN-5KERNEL-3L and LSTM-2L, outperform the state-of-the-art on the Snips dataset, with label-recurrence proving crucial, especially for Snips.In particular, CNN-5KERNEL-3L with label recurrence achieves an average SL F1 of 92.30, improving over the previous state-of-the-art of 88.8, by reducing error rate by 30%, and .57pointimprovement on IC.
On ATIS, our label-recurrent models outperform slot-gated LSTM model of Goo et al. (2018) on both IC and SL tasks.Wang et al. (2018) attribute their result to using IC and SL-specific LSTMs and use 300-dimensional word embedding and 200-dimensional LSTMs, but with an ATIS vocabulary of 867 words (suggesting a relatively simple sequence space), we are unable to determine the source of the improvement from a modeling standpoint.Similar observation was made for (Li et al., 2018) where 264-dimension embeddings is used.There is a high chance that these models might be severely overfitted.
We hypothesize that our models perform better on Snips because much of Snips slot labeling depends on consistency within long spans, whereas ATIS slot labels have longer-distance dependencies, for example between to city and from city tags.

Attention Visualization
We note that anecdotally, few words in each utterance are useful in indicating the intent.In the example given in Figure 2, presence of possible departure and arrival cities may be distracting, but the attention mechanism correctly learns to focus on words that indicate atis aircraft intent.

Conclusion
We presented a general family of joint IC+SL neural architectures that decomposes the task into modules for analysis.Using this framework, we conducted an extensive study of word contextualization methods (including utility of recurrence in the representation and output space) and determined that label-recurrent models, with non-recurrent word representation and a recurrent model of slot label dependencies, are a good fit for high performance in both accuracy and speed.
With the results of this study, we proposed a convolution-based joint IC+SL model for SLU that achieves new state-of-the-art results on Snips dataset while maintaining a simple design, shorter training, and faster inference than comparable recurrent methods.

Implementation details
For all models, we randomly initialize word embeddings and use d x = 70.We optimize using Adadelta algorithm (Zeiler, 2012), with initial learning rate, .01.We clip and pad all training and development sentences to length 30, with clipping affecting a small number of utterances.Dropout (Srivastava et al., 2014) probability of .3 is used in all models.We train using a batch size of 128 split across 4 GPUs on a p3.8xlargeEC2 instance, and perform inference using CPUs on same machine.

Figure 1 :
Figure 1: A general framework of joint IC+SL, decoupling modeling tasks to permit the analysis of each component independently.

Figure 2 :
Figure 2: Visualization of the weight given to each token representation by the attention-based pooling for sentence representation.Lighter colors indicate greater attention.

Table 1 :
Development results on the Snips 2017 and ATIS datasets, comparing models from feed-forward, convolutional, self-attention, and recurrent paradigms, as well as comparing non-recurrent, label-recurrent, and fully recurrent architectures, on IC, SL, inference speed, and training time.Inference speed, convergence time, and parameter count are drawn from Snips experiments, but the trends hold on ATIS.The best IC and SL for each dataset is bolded within each model paradigm to help compare between paradigms.

Table 3 :
Test set results on the ATIS dataset, compared to recent recurrent models.