Few-shot Intent Classification and Slot Filling with Retrieved Examples

Few-shot learning arises in important practical scenarios, such as when a natural language understanding system needs to learn new semantic labels for an emerging, resource-scarce domain. In this paper, we explore retrieval-based methods for intent classification and slot filling tasks in few-shot settings. Retrieval-based methods make predictions based on labeled examples in the retrieval index that are similar to the input, and thus can adapt to new domains simply by changing the index without having to retrain the model. However, it is non-trivial to apply such methods on tasks with a complex label space like slot filling. To this end, we propose a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective. At inference time, we use the labels of the retrieved spans to construct the final structure with the highest aggregated score. Our method outperforms previous systems in various few-shot settings on the CLINC and SNIPS benchmarks.


Introduction
Few-shot learning is a crucial problem for practical language understanding applications. In the fewshot setting, the model (typically trained on source domains with abundant data) needs to adapt to a set of unseen labels in the target domain with only a few examples. For instance, when developers introduce a new product feature, a query understanding model has to learn new semantic labels from a small dataset they manage to collect.
Few-shot learning is challenging due to the imbalance in the amount of data between the source and target domains. Traditional classification methods, even with the recent advancement of pretrained language models (Peters et al., 2018;Devlin et al., 2019), could suffer from over-fitting (Snell * work done during internship at Google Research et al., 2017;Triantafillou et al., 2019) or catastrophic forgetting (Wu et al., 2019) when incorporating the data-scarce target domain. On the other hand, metric learning methods (Weinberger et al., 2006;Vinyals et al., 2016;Snell et al., 2017) have been shown to work well in few-shot scenarios. These methods are based on modeling similarity between inputs, effectively allowing the model to be decoupled from the semantics of the output space. For example, a model would learn that the utterance "I'd like to book a table at black horse tavern at 7 pm" (from Figure 1) is similar to "make me a reservation at 8" and thus are likely to have similar semantic representations, even without knowing the semantic schema in use. Unlike learning output labels, which is difficult when examples are scarce, learning a similarity model can be done on the abundant source domain data, making such models data-efficient even in few-shot settings.
While there are many instantiations of metric learning methods (see Section 3), we focus on retrieval-based methods, which maintain an explicit retrieval index of labeled examples. The most basic setting of retrieval-based model for few-shot learning is: after training a similarity model and encoding target domain data into the index, we can retrieve examples most similar to the given input, and then make a prediction based on their labels. Compared to methods that do not maintain an index, such as Prototypical Networks (Snell et al., 2017), retrieval-based methods are less sensitive to outliers with few data points, and are powerful when we have abundant data in the source domain (Triantafillou et al., 2019).
However, applying retrieval-based models on tasks with a structured output space is non-trivial. For example, even if we know that the utterance in Figure 1 is similar to "make me a reservation at 8", we cannot directly use its slot values (e.g., the time slot has value "8" which is not in the input), and not all slots in the input (e.g., "black [CLS] I'd like to book a table at black horse tavern at 7 pm ... [CLS] Figure 1: Illustration of span-level retrieval for slot filling. For each span (including spans that are not valid slots such as "book a table") in the input utterance, we retrieve its most similar span from the retrieval index, and then assign the slot name as the prediction with a similarity score. We use modified beam search to decode an output that maximizes the average similarity score. The gold slots are "black horse tavern" and "7 pm" in this example.
horse tavern") have counterparts in the retrieved utterance. While previous works have exploited token-level similarity methods in a BIO-tagging framework, they had to separately simulate the label transition probabilities, which might still suffer from domain shift in few-shot settings (Wiseman and Stratos, 2019;Hou et al., 2020).
In this work, we propose Retriever, a retrieval-based framework that tackles both classification and span-level prediction tasks. The core idea is to match token spans in an input to the most similar labeled spans in a retrieval index. For example, for the span "7 pm" in the input utterance, the model retrieves "8" as a similar span (given their surrounding contexts), thus predicting that "7 pm" has the same slot name time as "8". During training, we fine-tune a two-tower model with BERT (Devlin et al., 2019) encoders, along with a novel batch softmax objective, to encourage high similarity between contextualized span representations sharing the same label. At inference time, we retrieve the most similar span from the fewshot examples for every potential input span, and then decode a structured output that has the highest average span similarity score.
We show that our proposed method is effective on both few-shot intent classification and slotfilling tasks, when evaluated on CLINC (Larson et al., 2019) and SNIPS (Coucke et al., 2018) datasets, respectively. Experimental results show that Retriever achieves high accuracy on fewshot target domains without retraining on the target data. For example, it outperforms the strongest baseline by 4.45% on SNIPS for the slot-filling task.

Benefits of Retriever
In addition to being more robust against overfitting and catastrophic forgetting problems, which are essential in few-shot learning settings, our proposed method has multiple advantages overs strong baselines. For instance, if the scheme is changed or some prediction bugs need to be fixed, there is minimum re-training required. More importantly, compared to classification or Prototypical Networks which require adding arbitrary number of instances to the training data and hope that the model will predict as expected Liang et al., 2020), Retriever can guarantee the prediction when a similar query is encountered. At the same time, Retriever is more interpretable where the retrieved examples can serve as explanations. In addition, different from the simplified assumption that one utterance may only have one intent (Bunt, 2009;, Retriever can be used to predict multiple labels. Lastly, because Retriever does not need to model transition probability, the decoding procedure can be paralleled and potentially be modified to be non-autoregressive for speedup. We can also tune threshold (explained in Section 5.2) to change precision and recall according to use case requirements.

Related Work
Few-shot metric learning Metric learning methods target at learning representations through distance functions. Koch et al. (2015) proposed Siamese Networks which differentiated input examples with contrastive and triplet loss functions (Schroff et al., 2015) on positive and negative pairs. While they are more data efficient for new classes than linear classifiers, Siamese Networks are hard to train due to weak pairs sampled from training batch (Gillick et al., 2019). In comparison, Prototypical Networks (Snell et al., 2017) proposed to compute class representations by averaging embeddings of support examples for each class. These methods have been mostly explored in computer vision and text classification (Geng et al., 2019;Yu et al., 2018), and consistently outperform Siamese Networks and retrieval-based methods such as knearest-neighbors, especially when there are more classes and fewer annotated examples (Triantafillou et al., 2019;. However, newly added examples which are outliers may change the prototypical representations dramatically that can harm all predictions on the class. In addition, these methods do not perform well when there are more annotated data available per class (Triantafillou et al., 2019).
Recently, Wang et al. (2019) showed that a simple nearest neighbor model with feature transformations can achieve competitive results with the state-of-the-art methods on image classification. Inspired by their work, we train our retrieval-based model with a novel batch softmax objective.

Metric learning in language understanding
Utilizing relevant examples to boost model performance has been applied to language modeling (Khandelwal et al., 2020), question answering (Guu et al., 2020;Lewis et al., 2020), machine translation (Zhang et al., 2018), and text generation (Peng et al., 2019). Recently, metric learning has been applied to intent classification Krone et al., 2020). Ren and Xue (2020) trained Siamese Networks before learning a linear layer for intent classification and showed competitive results with traditional methods in the full-data setting. Similar ideas are also extended to sequence labeling tasks such as named entity recognition (NER, Wiseman and Stratos, 2019;Fritzler et al., 2019) by maximizing the similarity scores between contextual tokens representations sharing the same label. Krone et al. (2020) utilized Prototypical Networks to learn intent and slot name prototype representations and classified each token to its closest prototype. They showed better results than meta-learning, another prevalent few-shot learning method (Finn et al., 2017;Mishra et al., 2018). In order to consider label dependencies that are essential in slot tagging tasks (Huang et al., 2015), Hou et al. (2020) proposed a collapsed dependency transfer (CDT) mechanism by simulating transition scores for the target domain from transition probabilities among BIO labels in the source domain, outperforming previous methods on slot filling by a large margin. Yang and Katiyar (2020) further explored the transition probability by evenly distributing the collapsed transition scores to the target domain to maintain a valid distribution. However, this simulation is noisy and the difference between the source and target domains can result in biased transition probabilities.
The most similar approach to ours is a concurrent work from Ziyadi et al. (2020), which learns span boundaries and sentence similarities before retrieving the most similar span, inspired by questionanswering models. Even though this approach predicts spans before retrieving on the span level and thus bypasses the problem of transition probability in previous research, it only achieves unsatisfactory results. Different from these researches, we propose to learn span representations using a batch softmax objective without having to explicitly learn span boundaries. Our method achieves more accurate slot and intent prediction than previous methods in the few-shot setting.

Setup
We consider two tasks where the input is an utterance x with tokens x 1 , . . . , x n and the output is some structure y. For the slot filling task, the output y is a set of non-overlapping labeled spans where r i is a span of x (e.g., "7 pm") and i is a slot name (e.g., time). For the intent classification task, the output y is simply an intent label for the whole utterance x. For notational consistency, we view intent classification as predicting a labeled span (r, ) where r = x 1:n .
In the few-shot setup, examples (x, y) are divided into source and target domains. Examples in the target domain may contain some labels that are unseen in the source domain. The model will be given ample training data from the source domain, but only a few training examples from the target domain. For instance, the model receives only K = 5 examples for each unseen label. The model can be evaluated on test data from both domains.

Model
We propose a retrieval-based model, Retriever, for intent classification and slot filling in the fewshot setting. Figure 1 illustrates our approach. At a high level, from examples (x, y) in the target training data (and optionally the source training data), we construct a retrieval index consisting of labeled spans (r, ) from y. Given a test utterance x, for each span of interest in x (all spans x i:j for slot filling; only x 1:n for intent classification), we retrieve the most similar labeled spans (r, ) from the index, and then use them to decode an output y that maximizes the average span similarity score.
The use of retrieval provides several benefits. For instance, we empirically show in Section 7.1 that the model does not suffer from catastrophic forgetting because both source and target data are present in the retrieval index. Class imbalance can also be directly mitigated in the retrieval index. Additionally, since the trained model is nonparametric, we could replace the retrieval index to handle different target domains without having to retrain the model. This also means that the model does not need access to target data during training, unlike traditional classification methods.

Retriever
The retriever is the only trainable component in our model. Given a query span r = x i:j from the input x, the retriever returns a set of labeled spans (r, ) with the highest similarity scores s(z, z ), where z = E(r) and z = E(r ) are the contextualized embedding vectors of r and r , respectively.
Similarity score To compute the contextualized embeddings z and z of spans r and r , we first apply a Transformer model initialized with pretrained BERT on the utterances where r and r come from. For slot filling, we follow Toshniwal et al. (2020) and define the span embedding as the concatenated embeddings of the its first and last wordpieces. For intent classification, we use the embedding of the [CLS] token. We then define s(z, z ) as the dot product 1 between z and z .
Training with batch softmax We use examples from the source domain to train Retriever. Let 1 , . . . , N be the N class labels (slot or intent labels) in the source domain. To construct a training batch, for each class label i , we sample B spans r 1 i , . . . , r B i from the training data with that label, and compute their embeddings z 1 i , . . . , z B i .
Then, for each query span r j i , we compute similarity scores against all other spans in the batch to form a B × N similarity matrix: (1) We now summarize the score between r j i and each label i by applying a reduction function (defined shortly) along each column to get a 1 × N vector: (2) We use the softmax ofŜ j i as the model's probability distribution on the label of r j i . The model is then trained to optimize the cross-entropy loss on this distribution against the gold label i .
We experiment with three reduction functions, mean (Eq. 3), max (Eq. 4), and min-max (Eq. 5): The mean reduction averages embeddings of the spans with the same label and is equivalent to Prototypical Networks. Similar to hard negative sampling to increase margins among classes (Schroff et al., 2015;Roth et al., 2020;, max takes the most similar span to the query (excluding the query itself) as the label representation, while min-max takes the least similar span when considering spans with the same label as the query.

Inference
After training, we build a dense retrieval index where each entry (r, ) is indexed by z = E(r). The entries (r, ) come from examples (x, y) in the support set which, depending on the setting, could be just the target training data or a mixture of source and target data. For each query span r of the input utterance x, we embed the span and compute the similarity scores against all index entries.
Intent classification For intent classification, both index entries and query spans are restricted to the whole utterances. The entire process thus boils down to retrieving the most similar utterance based on the [CLS] token embedding. We simply output the intent label of the retrieved utterance.
Slot filling In contrast to BIO decoding for tokenlevel similarity models (Hou et al., 2020), decoding with span retrieval results poses unique challenges as gold span boundaries are not known a priori. Hence, we use a modified beam search procedure with simple heuristics to compose the spans.
Specifically, for each of the n × m spans in an utterance of length n (where the hyperparameter m is the maximum span length), we retrieve the most similar span from the retrieval index. Then we normalize the similarity scores by L2-norm so that they are within the range [0, 1]. Since we do not explicitly predict span boundaries, all n × m spans, including non-meaningful ones (e.g., "book a"), will have a retrieved span. Such non-meaningful spans should be dissimilar to any labeled span in the retrieval index. We thus choose to filter the spans with a score threshold to get a smaller set of candidate spans. In addition, we adjust the threshold dynamically (by reducing the threshold for a few times) if no span is above the current threshold.
Once we get candidate spans with similarity scores, we use beam search to decode a set of spans with maximum average scores. 2 We go through the list of candidate spans in the descending order of their similarity scores. For each candidate span, we expand beam states if the span does not overlap with the existing spans in the beam. The search beams are pruned based on the average similarity score of the spans included so far. Lastly, we add spans in the filtered set which do not overlap with the final beam.
Beam search can avoid suboptimal decisions that a greedy algorithm would make. For instance, if we greedily process the example in Figure 1, "black" and "tavern" would become two independent spans, even though their average similarity score is lower than the correct span "black horse tavern". Nevertheless, beam search is prone to mixing up span boundaries and occasionally predicts consecutive partial spans such as "black horse" and "tavern" as individual slots. Since consecutive spans of the same slot label are rare in slot filling 2 We use beam search for simplicity. Other search methods such as Viterbi algorithm (Forney, 1973) can also be used. datasets, we merge the two spans if their retrieval scores are within a certain range: if |s(zi:j, z ) − s(z j:k , z )| < λ ri:j, r j:k otherwise where r i:j and r j:k are two consecutive potential spans sharing the same label, and z and z are the embeddings of their retrieved spans, respectively (r i:k indicates merging the two spans into one span; λ is the merge threshold where λ = 1 means always merge and λ = 0 means never merge).

Experiments and Results
We evaluate our proposed approach on two datasets: CLINC (Larson et al., 2019) for intent classification and SNIPS (Coucke et al., 2018) for slot filling. Note that we use max (Eq. 4) as the reduction function for both tasks since it empirically yields the best results. The effect of reduction functions will be analyzed later in Section 7.1.

Intent Classification
The CLINC intent classification dataset (Larson et al., 2019) contains utterances from 10 intent categories (e.g., "travel"), each containing 15 intents (e.g., "flight_status", "book_flight"). To simulate the few-shot scenario where new domains and intents are introduced, we designate n c categories and n i intents per category as the source domain (with all 100 training examples per intent), and use the remaining 150 − n c × n i intents as the target domain. We experiment with (n c , n i ) = (10, 10), (8, 10), and (5, 15). 3 The target training data contains either 2 or 5 examples per target intent.
We compare our proposed method Retriever with a classification model BERT fine-tune and a Prototypical Network model Proto. The former learns a linear classifier on top of BERT embeddings (Devlin et al., 2019), and the latter learns class representations based on Prototypical Networks. 4 We also show results with the initial BERT checkpoint without training (Proto frz , Retriever frz ). We use the same batch size for all models, and tune other hyperparameters on the development set before testing.   Evaluation We sample domains and intents three times for each (n c , n i ) setting, and report average prediction accuracy. We report accuracy on intents from the target domain (tgt), source domain (src), and the macro average across all intents (avg). In addition to applying the model to the target domain after pre-training on the source domain without re-training (Pre-train on src domain), we also evaluate the model performance with fine-tuning. We re-train the model with either target domain data only (Fine-tune on tgt domain) or a combination of source and target domain data (Fine-tune on tgt domain with src data).
Moreover, we evaluate the models with the following support set variations: with target domain data and all data in the source domain (sup-port_set=all), with equal number of examples (same as the few-shot number) per intent (sup-port_set=balance), and with only examples from the target domain (support_set=tgt). The last one serves as an upper-bound for the target domain accuracy.
Results Table 1 shows the results for (n c , n i ) = (10, 10) and 5 examples per target intent; results on other settings exhibit the same patterns (See Appendix A.3). We observe that Retriever per-forms the best on the source domain (97.08%) before fine-tuning. Retriever also achieves the highest accuracy on the target domain (84.95%) after fine-tuning, while maintaining competitive performance on the source domain (95.41%) among all the methods. (2020), we train models on five source domains, use a sixth one for development, and test on the remaining domain. We directly use the K-shot split provided by Hou et al. (2020), where the support set consists of the minimum number of utterances such that at least K instances exist for each slot name. We also set K = 5 in our experiment. Appendix A.2 contains further details about the setup.
We compare against two baselines and three models from the previous work. BERT Tagging is a BERT-based BIO tagging model (Devlin et al., 2019) Table 2: Results on SNIPS test data with 5-shot support sets. Our span-based retrieval model outperforms previous classification-based and token-level retrieval models even without label semantics. Classification-based and tokenlevel results are reported in Hou et al. (2020). *Pair-wise embeddings (marked with pw ) are expensive at inference time, so we do not compare our method with these directly.
after training on the source domains, while SimilarToken frz uses BERT embeddings to retrieve the most similar token based on cosine similarity without any training. MatchingToken and ProtoToken are two token-level methods that leveraged Matching Networks (Vinyals et al., 2016) and Prototypical Networks (Snell et al., 2017) respectively. L-TapNet+CDT+proto (Hou et al., 2020) is an adaptation of TapNet (Yoon et al., 2019) with label semantics, CDT transition probabilities, and Prototypical Networks. We experiment with several variants of our proposed method. Proto trains Prototypical Networks to compute span class representations. Retriever retrieves the most similar slot example for each span. Both methods use the same decoding method. Similar to SimilarToken frz , Proto frz and Retriever frz use the original BERT embeddings without any training. All models are trained on source domains and early stopped based on performance on the development domains.
Evaluation We report F1 scores for each testing domain in a cross-validation episodic fashion. Following Hou et al. (2020), we evaluate each testing domain by sampling 100 different support sets and ten exclusive query utterances for each support set. We calculate F1 scores for each episode and report average F1 scores across 100 episodes.
Results Table 2 summarizes the experiment results on the SNIPS dataset. Our span-level method (Retriever) achieves higher averaged F1 than all five baselines, outperforming the strongest token-level method (L-TapNet+CDT+proto) by 4.45%. This shows that our model is effective at span-level predictions. More importantly, the better performance suggests that our span-level Retriever model is more efficient at capturing span structures compared to simulated dependencies as our method does not suffer from the potential discrepancy in the transition probabilities between the target and source domains.
Although Hou et al. (2020) showed that adding pairwise embeddings with cross-attention yielded much better performance, this method is expensive both in memory and computation at inference time, especially when the support set is large (Humeau et al., 2019). For fair comparison, we do not directly compare with methods using pairwise embeddings (methods with pw in Table 2). Note that our method with pre-computed support example embeddings even outperforms L-Proto+CDT pw with less memory and computation cost.

Intent Classification
Models without re-training The pre-train on src domain section in Table 1 shows the results of models that are only pre-trained on the source domains but not fine-tuned on the target domains. Classification models such as BERT fine-tune cannot make predictions on target domains in this setting. In contrast, even without seeing any target domain examples during training, retrieval-based models can still make predictions on new domains by simply including new examples in the support sets. With support_set=all, Retriever achieves 97.08% on the source domain while Proto performs worse than BERT fine-tune, consistent with previous findings (Triantafillou et al., 2019). Retriever achieves the best accuracy (75.93%) on target domains with a balanced support set on all intents (support_set=balance). More importantly, Retriever also achieves competitive accuracy on source domains (95.44%), demonstrating that our proposed model achieves the best of both worlds even without re-training on new domains.
Varying the support set at inference time The construction of the support set is critical to retrievalbased methods. In Table 1, we present the model performances under different support settings (all, balance, tgt). The support_set=tgt setting serves as an upper bound for the target domain accuracy for both Retriever and Proto methods. In general, Retriever achieve the best performance on the source domain intents when we use full support sets (support_set=all). In comparison, if we use a balanced support set (support_set=balance), we can achieve much higher accuracy on the target domain while having a slight degradation on the source domain intents prediction. This is because full support sets have more source domain examples to increase confusion over target domains.
Data for fine-tuning The Fine-tune on tgt domain section in Table 1 shows different model behaviors when fine-tuned on the target domain data directly. While BERT fine-tune achieves high accuracy (78.89%) on the target domain, it suffers from catastrophic forgetting on the source domain (43.91%). On the other hand, Proto and Retriever can get high accuracy on the target domain (80.44% and 79.20%) while maintaining high performance on the source domain.
When we combine data from the source domain, we observe performance gains in all the models under the Fine-tune on tgt domain with src data section. Specifically, we add few-shot source domain examples as contrastive examples for the models to learn better utterance/class representations for Retriever and Proto. Results show that accuracy on the target domain increases by over 3% compared to only using target domain data. This tgt src avg +12.89 -0.51 +5.18 Retriever +14.60 -0.14 +6.11 Retriever min-max +10.79 -0.20 +4.47 Table 3: Improvement (%) over BERT fine-tune on target (tgt), source (src), and average (avg), after fine-tuning on the 5-shot support sets. Numbers are averaged over different (n c , n i ) data samples.
suggests that unlike other retrieval-based methods such as kNN, Retriever does not require a large support set to guarantee prediction accuracy.

Impact of reduction functions
We compare the reduction functions proposed in Section 5.1 and found that max performs the best. Since mean is equivalent to Prototypical Networks, we compared to Proto directly in the experiments. min-max is more intuitive in contrasting with least similar examples within the same class compared to max. However, its performance is worse than max. We speculate the reason to be that we retrieve the example with the maximum score at inference time so that the boundary margin may not be utilized. Table 3 shows the average improvement of our methods over the BERT fine-tune baseline, where all models are fine-tuned on the target domain with a balanced few-shot dataset after training on the source domain (same as Fine-tune on tgt domain with src data section in Table 1). Both Proto and Retriever outperforms the baseline on the target domains with a large margin, and Retriever has the best improvement on all intents on average.

Slot Filling
We note that Retriever outperforms the strongest baselines but reaches a low score on the SCW domain. This may be due to the bigger difference between the test (SCW) and the development domain (GW) including the size of the support set and their respective slot names. We also found that from all the correctly predicted slot spans, 96.73% predicted the correct slot names. This shows that the majority of the errors come from querying with invalid spans. We believe that span-based pretraining such as Span-BERT (Joshi et al., 2020) could make our proposed method achieve better results.  Table 4: Ablation study on beam size and merge condition. Merge threshold of 0 means never merge and 1 means always merge. Using larger beam and merging consecutive spans improve F1 scores.
Analyzing Proto From Table 2, Retriever outperforms Proto by 17% when training the span representations. We conjecture that this is caused by Proto learning noisy prototype. Compared to Retriever, the similarity scores between the spans and their corresponding class representations are low, indicating that the span-level prototypes may not be clearly separated.
Ablation on decoding method Table 4 compares beam search to greedy search. Results suggest that beam search with larger beam sizes achieve better F1 scores. As discussed in Section 5.2, we merge same-label spans during inference based on a score threshold. As shown in Table 4, merging spans results in a 1.67% F1 gain (70.43% vs 72.10%) under the same beam size.
Error Analysis We find that the main problem of our proposed model is that tokens surrounding the gold span may contain excessive contextual information so that these surrounding invalid spans retrieve corresponding spans with high similarities. For instance, in the query "add my track to old school metal playlist", the token "playlist" retrieves an actual playlist span with a high similarity score. Another major issue is that the similarity score retrieved by a partial of the gold span sometimes is higher than that retrieved by the whole span. Our ablation results on merge threshold shown in Table 4 also suggest that partial spans may retrieve complete spans individually so that if we merge consecutive spans with the same slot name, we can achieve higher F1 scores.

Conclusion
In this paper, we propose a retrieval-based method, Retriever, for few-shot intent classification and slot filling. We conduct extensive experiments to compare different model variants and baselines, and show that our proposed approach is effective in the few-shot learning scenario. We believe that our method can also work on open domain dialog tasks where annotations may be more scarce and other text classification tasks. In the future, we plan to extend our method to predict more complex structures with span-based retrieval.

Ethical Considerations
Our intended use case is few-shot domain adaption to new classes. Our experiments are done on English data, but the method is not English-specific. We use 8 Cloud TPUs V2 cores 5 for training and one V100 GPU for inference. Since our model does not have to be retrained for the new domains, it can reduce the resources needed when applying such systems. We claim that our proposed method outperforms baselines on few-shot slot filling and intent classification examples. Our experiments mainly focus on the 5-shot setting and the 2-shot setting, which are typical testing scenarios applied by previous work with the same claim.

A Appendices
A.1 Implementation details We use the public uncased BERT-base model from https://github.com/ google-research/bert for embedding spans.
Our implementation is adapted from https://github.com/ google-research/bert/blob/master/ run_classifier.py. Since the span embedder in the retriever is the only trainable component in our model, the number of parameters is the same as the initial BERT model.
On SNIPS, we set the initial learning rate to be 2 × 10 −5 with 10% data for warmup. We set per-class batch size to be 5 for 5-shot experiments. We use F1 score on the development domain as the metric for early stopping. For decoding, we set m = 7 to be the maximum span length and λ = 0.99 as the merging threshold. For dynamic threshold, we decrease the threshold by 0.05 each time for 10 times until at least one span is above the current threshold. We also use the development domain results to choose individual threshold for each target domain to filter invalid spans. We use grid search between [0.85, 0.97] with a step of 0.05 to search for the best threshold on the development domain. Our span-level evaluation is modified from conlleval script: https://www.clips.uantwerpen.be/ conll2000/chunking/conlleval.txt.
On CLINC, we set the initial learning rate to be 5 × 10 −5 and 1 × 10 −5 for fine-tuning on the target domain. We set per-class batch size to be 8 for training on the source domain, and 5 and 2 for 5-shot and 2-shot fine-tuning.        Table 11: Intent accuracy on CLINC for n c = 5, n i = 15 with 2-shots.