Learning to Classify Intents and Slot Labels Given a Handful of Examples

Intent classification (IC) and slot filling (SF) are core components in most goal-oriented dialogue systems. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark by defining few-shot splits for three public IC/SF datasets, ATIS, TOP, and Snips. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine-tuning baseline on this benchmark. Prototypical networks achieves significant gains in IC performance on the ATIS and TOP datasets, while both prototypical networks and MAML outperform the baseline with respect to SF on all three datasets. In addition, we demonstrate that joint training as well as the use of pre-trained language models, ELMo and BERT in our case, are complementary to these few-shot learning methods and yield further gains.


Introduction
In the context of goal-oriented dialogue systems, intent classification (IC) is the process of classifying a user's utterance into an intent, such as Book-Flight or AddToPlaylist, referring to the user's goal. While slot filling (SF) is the process of identifying and classifying certain tokens in the utterance into their corresponding labels, in a manner akin to named entity recognition (NER). However, in contrast to NER, typical slots are particular to the domain of the dialogue, such as music or travel. As a reference point, we list intent and slot label annotations for an example utterance from the SNIPS dataset with the AddToPlaylist IC in Figure 1  As of late, most state-of-the-art IC/SF models are based on feed-forward, convolutional, or recurrent neural networks (Hakkani-Tür et al., 2016;Goo et al., 2018;Gupta et al., 2019). These neural models offer substantial gains in performance, but they often require a large number of labeled examples (on the order of hundreds) per intent class and slot-label to achieve these gains. The relative scarcity of large-scale datasets annotated with intents and slots prohibits the use of neural IC/SF models in many promising domains, such as medical consultation, where it is difficult to obtain large quantities of annotated dialogues.
Accordingly, we propose the task of few-shot IC/SF, catering to domain adaptation in low resource scenarios, where there are only a handful of annotated examples available per intent and slot in the target domain. To the best of our knowledge, this work is the first to apply the few-shot learning framework to a joint sentence classification and sequence labeling task. In the NLP literature, fewshot learning often refers to a low resource, cross lingual setting where there is limited data available in the target language. We emphasize that our definition of few-shot IC/SF is distinct in that we limit the amount of data available per target class rather than target language.
Few-shot IC/SF builds on a large body of existing few-shot classification work. Drawing inspiration from computer vision, we experiment with two prominent few shot image classification approaches, prototypical networks and model agnostic meta learning (MAML). Both these methods seek to decrease over-fitting and improve generalization on small datasets, albeit via different mechanisms. Prototypical networks learns class specific representations, called prototypes, and performs inference by assigning the class label associated with the prototype closest to an input embedding. Whereas MAML modifies the learning objective to optimize for pre-training representations that transfer well when fine-tuned on a small number of labeled examples.
For benchmarking purposes, we establish fewshot splits for three publicly available IC/SF datasets: ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018), and TOP . Empirically, prototypical networks yields substantial improvements on this benchmark over the popular "fine-tuning" approach (Goyal et al., 2018;Schuster et al., 2018), where representations are pre-trained on a large, "source" dataset and then fine-tuned on a smaller, "target" dataset. Despite performing worse on intent classification, MAML also achieves gains over "fine-tuning" on the slot filling task. Orthogonally, we experiment with the use of two pre-trained language models, BERT and ELMO, as well as joint training on multiple datasets. These experiments show that the use of pre-trained, contextual representations is complementary to both methods. While prototypical networks is uniquely able to leverage joint training to consistently boost slot filling performance.
In summary, our primary contributions are fourfold: 1. Formulating IC/SF as a few-shot learning task; 2. Establishing few-shot splits 1 for the ATIS, SNIPS, and TOP datasets; 3. Showing that MAML and prototypical networks can outperform the popular "finetuning" domain adaptation framework; 1 Few-shot split intent assignments given in section A.1 4. Evaluating the complementary of contextual embeddings and joint training with MAML and prototypical networks.
2 Related Work

Few-shot Learning
Early adoption of few-shot learning in the field of computer vision has yielded promising results. Neural approaches to few-shot learning in computer vision fall mainly into three categories: optimization-, metric-, or memory-based. Optimization-based methods typically learn an initialization or fine-tuning procedure for a neural network. For instance, MAML (Finn et al., 2017) directly optimizes for representations that generalize well to unseen classes given a few labeled examples. Using an LSTM based meta-learner, Ravi and Larochelle (2016) learn both the initialization and the fine-tuning procedure. In contrast, metric-based approaches learn an embedding space or distance metric under which examples belonging to the same class have high similarity. Prototypical networks (Snell et al., 2017), siamese neural networks (Koch, 2015), and matching networks (Vinyals et al., 2016) all belong to this category. Alternatively, memory based approaches apply memory modules or recurrent networks with memory, such as a LSTM, to few-shot learning. These approaches include differentiable extensions to k-nearest-neighbors (Kaiser et al., 2017) and applications of the Neural Turing Machines (Graves et al., 2014;Santoro et al., 2016).

Few-shot Learning for Text Classification
To date, applications of few-shot learning to natural language processing focus primarily on text classification tasks.  identify "clusters" of source classification tasks that transfer well to a given target task, and meta learn a linear combination of similarity metrics across "clusters". The source tasks with the highest likelihood of transfer are used to pre-train a convolutional network that is subsequently fine-tuned on the target task. Han et al. (2018) propose FewRel, a few-shot relation classification dataset, and use this data to benchmark the performance of few-shot models, such as prototypical networks and SNAIL (Mishra et al., 2017). ATAML (Jiang et al., 2018), one of the few optimization based approaches to few-shot sentence classification, extends MAML to learn taskspecific as well as task agnostic representations using feed-forward attention mechanisms. (Dou et al., 2019) show that further pre-training of contextual representations using optimization-based methods benefits downstream performance.

Few-shot Learning for Sequence Labeling
In one of the first works on few-shot sequence labeling, Fritzler et al. (2019) apply prototypical networks to few-shot named entity recognition by training a separate prototypical network for each named entity type. This design choice makes their extension of prototypical networks more restrictive than ours, which trains a single model to classify all sequence tags. (Hou et al., 2019) apply a CRF based approach that learns emission scores using pre-trained, contextualized embeddings to few-shot SF (on SNIPS) and few-shot NER.

Few-shot Classification
The goal of few-shot classification is to adapt a classifier f φ to a set of new classes L not seen at training time, given a few labeled examples per class l ∈ L. In this setting, train and test splits are defined by disjoint class label sets L train and L test , respectively. The classes in L train are made available for pre-training and those in L test are held out for low resource adaptation at test time. Few-shot evaluation is done episodically, i.e. over a number of mini adaptation datasets, called episodes. Each episode consists of a support set S and a query set Q. The support set contains k l labeled examples S l = {(x i l , y l )|i∈(1. . .k l )} per held out class l ∈ L; we define S = l∈L S l . Similarly, the query set contains k q labeled instances The support set provides a few labeled examples of new classes not seen at training time that f φ must adapt to i.e. learn to classify, whereas the query set is used for evaluation. The definition of few-shot classification requires that evaluation is done on episodes; however, most few-shot learning methods train as well as evaluate on episodes. Consistent with prior work, we train both MAML and prototypical networks methods on episodes, as opposed to mini-batches.

Few-shot IC/SF
Few-shot IC/SF extends the prior definition of fewshot classification to include both IC and SF tasks.
As Geng et al. (2019) showed, it is straightforward to formulate IC as a few-shot classification task. Simply let the class labels y l in section 3.1 correspond to IC labels and partition the set of ICs into the train and test splits, L train and L test . Building on this few-shot IC formulation, we re-define the support and query sets to include the slots t l , in addition the intent y l , assigned to each example x l . Thus, the set of support and query instances for To construct an episode, we sample a total of k l +k q labeled examples per IC l ∈ L to form the support and query sets. Since utterances can exhibit many unique slot-label combinations, it is possible to sample an episode such that a slot-label appears in only the query or support set. Therefore, to ensure fair evaluation, we "mask" any slot-label that appears in only the query or support set by replacing it with the Other slot label, which is ignored by our SF evaluation metric.

Prototypical Networks for Joint Intent Classification and Slot Filling
The original formulation of prototypical networks (Snell et al., 2017) is not directly applicable to sequence labeling. Accordingly, we extend prototypical networks to perform joint sentence classification and sequence labeling. Our extension computes "prototypes" c l and c a for each intent class l and slot-label a, respectively. Each prototype c ∈ R D is the mean vector of the embeddings belonging to a given intent class or slot-label class. These embeddings are output by a sequence encoder f φ (x) :→ R D , which takes a variable length utterance of m tokens as input, and outputs the final hidden state h ∈ R D of the encoder. For ease of notation, let S l = {(x i l , t i l , y l )} be the support set instances with intent class y l .
a} be the support set sub-sequences with slot-label a for the token x i j in x i . Using this notation, we calculate slot-label and intent class prototypes as follows: Figure 2: Three model architectures, each consisting of an embedding layer, comprised of either GloVe word embeddings (GloVe), GloVe word embeddings concatenated with ELMo embeddings (ELMo), or BERT embeddings (BERT), that feed into a bi-directional LSTM, which is followed by fully connected intent and slot output layers.
Given an example (x * , t * , y * ) ∈ Q, we compute the conditional probability p(y = l | x * , S) that the utterance x * has intent class l as the normalized Euclidean distance between f φ (x * ) and the prototype c l , Similarly, we compute the conditional probabil- We define the joint IC and SF prototypical loss function L proto as the sum of the IC and SF negative log-likelihoods averaged over the query set instances given the support set:

Model Agnostic Meta Learning (MAML)
MAML optimizes the parameters φ of the encoder f φ such that when φ is fine-tuned on the support set S for d steps, φ ← Finetune(φ, d |S), the finetuned model f φ generalizes well to new class instances in the query set Q. This is achieved by updating φ to minimize the loss of the fine-tuned model L(f φ , Q) on the query set Q. The update where L is the sum of IC and SF softmax cross entropy loss functions. Concretely, given a support and query set (S, Q), MAML performs the following two step optimization procedure: Although, the initial formulation of MAML, which we outline here, utilizes stochastic gradient descent (SGD) to update the initial parameters φ, in practice, an alternate gradient based update rule can be used in place of SGD. Empirically, we find it beneficial to use Adam in place of SGD. A drawback to MAML is that computing the "meta-gradient" ∇ φ L(f φ , Q) requires calculating a second derivative, since the gradient must backpropagate through the sequence of updates made by Finetune(φ, d |S). Fortunately, in the same work where (Finn et al., 2017) introduce MAML, they propose a first order approximation of MAML, foMAML, which ignores these second derivative terms and performs nearly as well as the original method. We utilize foMAML in our experiments to avoid memory issues associated with MAML.

Few-shot IC/SF Benchmark
As there is no existing benchmark for few-shot IC/SF, we propose few-shot splits for the Air Travel Information System (ATIS, Hemphill et al. (1990)), SNIPS (Coucke et al., 2018), and Task Oriented Parsing (TOP, ) datasets. A few-shot IC/SF benchmark is beneficial for two reasons. Firstly, the benchmark evaluates generalization across multiple domains. Secondly, researchers can combine these datasets in the future to experiment with larger settings of n-way during training and evaluation.

Datasets
ATIS is a well-known dataset for dialog system research, which comprises conversations from the airline domain. SNIPS, on the other hand, is a public benchmark dataset developed by the Snips corporation to evaluate the quality of IC and SF services. The SNIPS dataset comprises multiple domains including music, media, and weather. TOP, which pertains to navigation and event search, is unique in that 35% of the utterances contain multiple, nested intent labels. These hierarchical intents require the use of specialized models. Therefore, we utilize only the remaining, non-hierarchical 65% of utterances in TOP. To put the size and diversity of these datasets in context, we provide utterance, intent, slot-label, and slot value counts for each dataset in table 1.

Few-shot Splits
We target train, development, and test split sizes of 70%, 15%, and 15%, respectively. However, the ICs in these datasets are highly imbalanced, which prevents us from hitting these targets exactly. Thereby, we manually select the ICs to include in each split. For the SNIPS dataset, we choose not to form a development split because there are only 7 ICs in the SNIPS dataset, and we require a minimum of 3 ICs per split. During preprocessing we modify slot label names by adding the associated IC as a prefix to each slot. This preprocessing step ensures that the slot labels are no longer pure named entities, but specific semantic roles in the context of particular intents. In table 1, we provide statistics on the few-shot splits for each dataset.

Episode Construction
For train and test episodes, we sample both the the number of classes in each episode, the "way" n, and the number of examples to include for each sampled class l, the class "shot" k l , using the procedure put forward in (Triantafillou et al., 2019). By sampling the shot and way, we allow for unbalanced support sets and a variable number of classes per episode. These allowances are compatible with the large degree of class imbalances present in our benchmark, which would make it difficult to apply a fixed shot and way for all intents.
To construct an episode given a few-shot class split L split , we first sample the way n uniformly from the range [3, |L split |]. We then sample n intent classes uniformly at random from L split to form L. Next, we sample the query shot k q for the episodes as follows: where X l is the set of examples with class label l. Given the query shot k q , we compute the target support set size for the episode as: where β is sampled uniformly from the range (0, 1] and K max is the maximum episode size. Lastly, we sample the support shot k l for each class as: where R l is a noisy estimate of the normalized proportion of the dataset made up by class l, which we compute as follows: The noise in our estimate of the proportion R l is introduced by sampling the value of α l uniformly from the interval [log(0.5), log(2)).

Episode Sizes
We present IC/SF results for two settings of maximum episode size, K max = 20 and K max = 100, in tables 2/4 and 3/5, respectively. When the maximum episode size K max = 20, the average support set shot k l is 3.58 for ATIS, 3.78 for TOP, and 5.22 for SNIPS. In contrast, setting the maximum episode size to K max = 100 increases the average support set shot k l to 9.15 for ATIS, 9.81 for TOP, and 10.83 for SNIPS.

Training Settings
In our experiments, we consider two training settings. One in which we train on episodes, or batches in the case of our baseline, from a single dataset. And another, joint training approach that randomly selects the dataset from which to sample a given episode/batch. After sampling an episode, we remove its contents from a buffer of available examples. If there are no longer enough examples in the buffer to create an episode, we refresh the buffer to contain all examples.

Network Architecture
We evaluate the network architectures depicted in Figure 2. These networks consist of an embedding layer, a sequence encoder, and two output layers for slots and intents, respectively. We greedily predict the slot-label for each token in the input sequence, according to the maximum output logit at each position. We plan to explore alternate search algorithms (e.g., beam search) in future work.
Each architecture uses a different pre-trained embedding layer type, which are either non-contextual or contextual. We experiment with one noncontextual embedding, GLOVE word vectors (Pennington et al., 2014), as well as two contextual embeddings, GLOVE concatenated with ELMO embeddings (Peters et al., 2018), and BERT embeddings (Devlin et al., 2018). The sequence encoder is a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) with a 512-dimensional hidden state. Output layers are fully connected and take concatenated forward and backward LSTM hidden states as input. Pre-trained embeddings are kept frozen for training and adaptation. Attempts to fine-tune BERT led to inferior results. We refer to each architecture by its embedding type, namely GLOVE, ELMO, or BERT.

Baseline
We compare the performance of our approach against a FINE-TUNE baseline, which implements the domain adaptation framework commonly applied to low resource IC/SF (Goyal et al., 2018). We pre-train the FINE-TUNE baseline, either jointly or individually, on the classes in our training split(s). Then at evaluation time, we freeze the pre-trained encoder and "fine-tune" new output layers for the slots and intents included in the support set. This fine-tuned model is then used to predict the intent and slots for each held out example in the query set.

Hyper-parameters
We train all models using the Adam optimizer (Kingma and Ba, 2014). We use the default learning rate of 0.001 for the baseline and prototypical networks. For foMAML we set the outer learning rate to 0.0029 and finetune for d = 8 steps with an inner learning rate of 0.01. We pre-train the FINE-TUNE baseline with a batch size of 512. At test time, we fine-tune the baseline for 10 steps on the support set. We train the models without contextual embeddings (GloVe alone) for 50 epochs and those with contextual ELMo or BERT embeddings for 30 epochs because they exhibit faster convergence.

Evaluation Metrics
To assess the performance of our models, we report the average IC accuracy and slot F1 score over 100 episodes sampled from the test split of an individual dataset. We use the AllenNLP (Gardner et al., 2017) CategoricalAccuracy implementation to compute IC Accuracy. And to compute slot F1 score, we use the seqeval library's span based F1 score implementation. 2 The span based F1 score is a relatively harsh metric in the sense that a slot label prediction is only considered correct if the slot label and span exactly match the ground truth annotation.

Few-shot Learning Algorithms
Prototypical networks Considering both IC and SF tasks, prototypical networks is the best performing algorithm. The most successful variant of prototypical networks, Proto ELMO + joint training, obtains absolute improvements over the FINE-TUNE     Contextual Pretrained Embeddings A priori, it is reasonable to suspect that the performance gain obtained by our few-shot learning algorithms could be dwarfed by the benefit of using a large, pre-trained model like ELMO or BERT. However, our experimental results suggest that the use of pre-trained language models is complementary to our approach, in most cases. For example, ELMO increases the slot F1 score of foMAML from 14.07 to 33.81 and boosts the slot F1 of prototypical networks from 31.57 to 62.71 on the SNIPS dataset for K max = 100. Similarly, when K max = 20, BERT improves foMAML and prototypical networks TOP IC accuracy from 33.75% to 38.50% and from 43.20% to 52.76%, respectively. In aggregate, we find ELMO outperforms BERT. We quantify this via the average absolute improvement ELMO obtains over BERT when both models use the winning algorithm for a given dataset and training setting. On average, ELMO improves IC accuracy over BERT by 2% for K max = 20 and 1% for K max = 100. With respect to slot F1 score, ELMO produces an average gain over BERT of 5 F1 points for K max = 20 and 3 F1 points for K max = 100. This is consistent with previous findings in (Peters et al., 2019) that ELMO can outperform BERT on certain tasks when the models are kept frozen and not fine-tuned.

Joint Training
Few-shot learning algorithms are in essence learning to learn new classes. Therefore, these algorithms should be adept at leveraging a diverse training dataset to improve generalization. We test this hypothesis by jointly training each approach on all three datasets. Our results demonstrate that joint training has little effect on IC Accuracy; however, it improves the SF performance of prototypical networks, particularly on ATIS and TOP. Joint training increases Prototypical networks average slot F1 score, computed over datasets and model variants, by 4.41 points from 31.77 to 36.18 for K max = 20 and by 5.20 points from 32.99 to 38.19 when K max = 100. In comparison, Fine-tune obtains much smaller average absolute improvements, 0.55 F1 points and 1.29 F1 points for K max = 20 and K max = 100, respectively.

Conclusion
We show that few-shot learning techniques can substantially improve IC/SF performance in ultra low resource scenarios. Specifically, our extension of prototypical networks for joint IC and SF consis-tently outperforms a fine-tuning baseline with respect to both IC Accuracy and slot F1 score. Moreover, we establish a benchmark for few-shot IC/SF to support future work on this important topic. Our contribution is a step toward the creation of more sample efficient IC/SF models. Yet there is still considerable work to be done in pursuit of this goal.
In particular, we encourage the creation of larger few-shot IC/SF benchmarks to test how few-shot learning algorithms scale with larger episode sizes.