Augmented Natural Language for Generative Sequence Labeling

We propose a generative framework for joint sequence labeling and sentence-level classification. Our model performs multiple sequence labeling tasks at once using a single, shared natural language output space. Unlike prior discriminative methods, our model naturally incorporates label semantics and shares knowledge across tasks. Our framework is general purpose, performing well on few-shot, low-resource, and high-resource tasks. We demonstrate these advantages on popular named entity recognition, slot labeling, and intent classification benchmarks. We set a new state-of-the-art for few-shot slot labeling, improving substantially upon the previous 5-shot ($75.0\% \rightarrow 90.9\%$) and 1-shot ($70.4\% \rightarrow 81.0\%$) state-of-the-art results. Furthermore, our model generates large improvements ($46.27\% \rightarrow 63.83\%$) in low-resource slot labeling over a BERT baseline by incorporating label semantics. We also maintain competitive results on high-resource tasks, performing within two points of the state-of-the-art on all tasks and setting a new state-of-the-art on the SNIPS dataset.


Introduction
Transfer learning has been the pinnacle of recent successes in natural language processing. Large pre-trained language models are powerful backbones that can be fine-tuned for different tasks to achieve state-of-the-art performance in wide-raging applications (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Lewis et al., 2019;Yang et al., 2019;Liu et al., 2019).
While these models can be adapted to perform many tasks, each task is often associated to its own output space, which limits the ability to perform multiple tasks at the same time. For instance, a sentiment analysis model is normally a binary classifier that decides between class labels "positive" and "negative", while a multi-class entailment system classifies each input as "entail", "contradict", or "neither". This approach presents difficulty in knowledge sharing among tasks. That is, to train the model for a new task, the top-layer classifier is replaced with a new one that corresponds to novel classes. The class types are specified implicitly through different indices in the new classifier, which contain no prior information about the label meanings. This discriminative approach does not incorporate label name semantics and often requires a non-trivial amount of examples to train (Lee et al., 2020). While this transfer learning approach has been immensely successful, a more efficient approach should incorporate prior knowledge when possible.
Conditional generative modeling is a natural way to incorporate prior information and encode the output of multiple tasks in a shared predictive space. Recent work by Raffel et al. (2019) built a model called T5 to perform multiple tasks at once using natural language as its output. The model differentiates tasks by using prefixes in its input such as "classify sentiment:", "summarize:", or "translate from English to German:" and classify each input by generating natural words such as "positive" for sentiment classification or "This article describes ..." for summarization.
However, the appropriate output format for important sequence labeling applications in NLP, such as named entity recognition (NER) and slot labeling (SL) is not immediately clear. In this work, we propose an augmented natural language format for sequence labeling tasks. Our format locally tags words within the sentence ( Figure 1) and is easily extensible to sentence-level classification tasks, such as intent classification (IC).
Our highlighted contributions and main findings are as follows:  1) We propose an effective new output format to perform joint sequence labeling and sentence classification through a generation framework. 2) We demonstrate the ability to perform multiple tasks such as named entity recognition, slot labeling and intent classification within a single model. 3) Our approach is highly effective in lowresource settings. Even without incorporating label type semantics as priors, the generative framework learns more efficiently than a token-level classification baseline. The model improves further given natural word labels, indicating the benefits of rich semantic information. 4) We show that supervised training on related sequence labeling tasks acts as an effective meta-learner that prepares the model to generate the appropriate output format. Learning each new task becomes much easier and results in significant performance gains. 5) We set a new state-of-the-art for few-shot slot labeling, outperforming the prior state-of-theart by a large margin. 6) We plan to open source our implementation and will update the paper with a link to our repository.

Model Sequence Labeling as Generation
Most work on sequence labeling uses token-level classification frameworks. That is, given a list of tokens = { i } n i=1 , we perform a prediction on every token i to obtain y = {y where f (·) is a token-level prediction function. The prediction is accurate if it matches the original sequence label y = {y i } n i=1 . In contrast to this convention, we frame sequence labeling as a conditional sequence generation problem where given the token list , we generate an output list o = g( ) where g is a sequence-to-sequence model. A "naive" formulation for this task would be to directly generate o = y given . However, this approach is prone to errors such as word misalignment and length mismatch (see supplementary materials Section A.2 for discussion).
We propose a new formulation for this generation task such that, given the input sequence , our method generates output o in augmented natural language. The augmented output o repeats the original input sequence with additional markers that indicate the token-spans and their associated labels. More specifically, we use the format [ j , . . . , j+t Š L ] to indicate that the token sequence j , . . . , j+t is labeled as L. Fig. 1 depicts the proposed format and its equivalent canonical BIO format for the same input sentence. The conversion between the BIO format and our augmented natural language format is invertible without any information loss. This is crucial so that the generated output from model prediction can be converted back for comparison without uncertainty.
There are other formats which can encapsulate all the tagging information but are not invertible. For instance, outputting only the token spans of interest with tagging patterns [ j , . . . , j+t Š L ] without repeating the entire sentence results in the invertibility breaking down when there are duplicate token spans with different labels. We discuss this further in the appendix Section A.3.

Joint Sequence Classification and Labeling
Our sequence to sequence approach also supports joint sentence classification and sequence labeling by incorporating the sentence-level label in the augmented natural language format. In practice, we use the pattern (( sentence-level label )) in the beginning of the generated sentence, as shown in Fig.  1. The use of double parentheses is to prevent confusion with a single parenthesis that can occur in the original word sequence .

Training and Evaluation
We train our model by adapting the pre-trained T5 with the sequence to sequence framework. Additionally, we prefix the input with task descriptors in order to simultaneously perform multiple classification and labeling tasks, similar to the approach by Raffel et al. (2019). This results in a seamless multi-task framework, as illustrated in the top part of Fig. 2  late the F1 score for sequence labeling or accuracy for sentence classification.

Natural Labels
Labels are associated to real-world concepts that can be described through natural words. These words have rich information, but are often ignored in traditional discriminative approaches.
In contrast, our model naturally incorporate label semantics directly through the generation-asclassification approach. We perform label mapping in order to match the labels to its natural descriptions and use the natural labels in the augmented natural language output. Our motivation is as follows: (1) Pre-trained conditional generation models which we adapt on have richer semantics embedded in natural words, rather than dataset-specific label names. For instance, "country city state" contains more semantic information compared to "GPE", which is an original label in named entity recognition tasks. Using natural labels should allow the model to learn the association between word tokens and labels more efficiently, without requiring many examples. (2) Label knowledge can be shared among different tasks. For instance, after learning how to label names as "person", given a new task in another domain which requires labeling "artist", the model can more easily associate names with "artist" due to the proximity of "person" and "artist" in embeddings. This is not the case if the concept of "person" was learned with other uninformative words.

Related Work
Sequence to sequence learning has various applications including machine translation (Sutskever et al., 2014;Bahdanau et al., 2015), language modeling (Radford et al., 2018;Raffel et al., 2019), abstractive summarization (Rush et al., 2015, generative question answering (Dong et al., 2019), to name a few. However, the sequence-to-sequence framework is often not a method of choice when it comes to sequence labeling. Most models for sequence labeling use the token-level classification framework, where the model predicts a label for each element in the input sequence (Baevski et al., 2019;Li et al., 2019b;Chen et al., 2019). While select prior work adopts the sequence-to-sequence method for sequence labeling (Chen and Moschitti, 2018), this approach is not widely in use due to the difficulty of fixing the output length, output space, and alignment with the original sequence.
Multi-task and multi-domain learning often benefit sequence labeling performance (Changpinyo et al., 2018). The archetypal multi-task setup jointly trains on a target dataset and one or more auxiliary datasets. In the cross lingual setting, these auxiliary datasets typically represent high-resource languages (Schuster et al., 2018;Cotterell and Duh, 2017). While in a monolingual scenario, the auxiliary datasets commonly represent similar, highresource tasks. Examples of similar multi-task pairs include NER and slot labeling (Louvan and Magnini, 2019) as well as dialogue state tracking and language understanding (Rastogi et al., 2018).
A recent series of works frame natural language processing tasks, such as translation, question answering, and sentence classification, as conditional sequence generation problems (Raffel et al., 2019;Radford et al., 2019;Brown et al., 2020). By unifying the model output space across tasks to consist of natural language symbols, these approaches reduce the gap between language model pre-training tasks and downstream tasks. Moreover, this framework allows acquisition of new tasks without any architectural change. The GPT-3 model (Brown  Table 1: Results of our models trained on combinations of datasets. Results for Ours: individual are from models trained on a single respective dataset. We underline scores of our models that exceed previous state-of-the-art results in each domain. Scores in boldface are the best overall scores among our models, or among the baselines. We use the boldface and underline notation for the rest of the paper. et al., 2020) demonstrates the promise of this framework for few-shot learning. Among other successes, GPT-3 outperforms BERT-Large on the SuperGLUE benchmark using only 32 examples per task. To the best of our knowledge, we are the first to apply this multi-task conditional sequence generation framework to sequence labeling. The conditional sequence generation framework makes it easy to incorporate label semantics, in the form of label names such as departure city, example values like San Francisco, and descriptions like "the city from which the user would like to depart on the airline". Label semantics provide contextual signals that can improve model performance in multi-task and low-resource scenarios. Multiple works show that conditioning input representations on slot description embeddings improves multidomain slot labeling performance (Bapna et al., 2017;Lee and Jha, 2019). Embedding example slot values in addition to slot descriptions yields further improvements in zero-shot slot labeling (Shah et al., 2019). In contrast to our work, these approaches train slot description and slot value embedding matrices, whereas our framework can incorporate these signals as natural language input without changing the network architecture.

Data
Datasets We use popular benchmark data SNIPS (Coucke et al., 2018) and ATIS (Hemphill et al., 1990) for slot labeling and intent classification. SNIPS is an SLU benchmark with 7 intents and 39 distinct types of slots, while ATIS is a benchmark for the air travel domain (see appendix A.4 for details). We also evaluate our approach on two named entity recognition datasets, Ontonotes (Pradhan et al., 2013) and CoNLL-2003 (Sang andMeulder, 2003).

Construction of Natural Labels
We preprocess the original labels to natural words as follows. For Ontonotes and CoNLL datasets, we transform the original labels via mappings detailed in Table 9 and 5 in the appendix. For instance, we map "PER" to "person" and "GPE" to "country city state". For SNIPS and ATIS, we use the following rules to convert intent and slot labels: (1) we split words based on ".", " ", "/", and capitalized letters. For instance, we convert "object type" to "object type" and "AddToPlaylist" to "add to playlist". These rules result in better tokenization and enrich the label semantics. We refer to these as the natural label setting and use is as our default.

Multi-Task Sequence Classification and Slot Labeling
We first demonstrate that our model can perform multiple tasks in our generative framework and achieve highly competitive or state-of-the-art performance. We consider 4 sequence labeling tasks and 2 classification tasks: NER on Ontonotes and CoNLL datasets; and slot labeling (SL) and in-tent classification (IC) on SNIPS and ATIS dialog datasets. For comparison, we provide baseline results from the following models: In Table 1, we report a summary of the results for our method and the baselines. Our proposed model achieves highly competitive results for ATIS, Ontonotes, and CoNLL datasets, as well as stateof-the-art slot labeling and intent classification performance on the SNIPS dataset. Unlike all the baseline models, which can perform a single task on a specific dataset, our model can perform all the tasks considered at once (last row of Table 1). For the multi-task models, our results show that different sequence labeling task can help mutually benefit each other, where ATIS slot labeling result improves from 96.13 to 96.65 and CoNLL improves from 90.70 to 91.48. While there are other approaches that perform better than our models in some tasks, we highlight the simplicity of our generation framework which performs multiple tasks seamlessly. This ability helps the models transfer knowledge among tasks with limited data, which are demonstrated through the rest of the paper.

Limited Resource Scenarios and Importance of Label Semantics
In this section, we show that our model can use the semantics of labels to learn efficiently, which is crucial for scenarios with limited labeled data. To demonstrate this effect, we use our model with the following variants of labels which differ semantic quality: (1) natural label, (2) original label and (3) numeric label.
The natural label version is our default setting where we use labels expressed in natural words. The original label case uses labels provided by the datasets, and the numeric label case uses numbers 0, 1, 2, ... as label types. In the numeric version, the model does not have pre-trained semantics of the label types and has to learn the associations between the labels and the relevant words from scratch. We also compare with the BERT tokenlevel classification model. Similar to the numeric label case, the label types for BERT do not initially have associated semantics and are implicit through indices in the classifier weights. We use the SNIPS dataset to conduct our experiments due to its balanced domains (see Table 7 in Appendix). We experiment with very limited resource scenarios where we use as low as 0.25% of training data, corresponding to roughly one training sentence per label type on average. Figure 3a shows the sequence labeling performance for varying amount of training data (see Table 10 in the appendix for numeric results). We observe that label semantics play a crucial role in the model's ability to learn effectively for limited resource scenarios. Our model with natural labels outperforms all other models, achieving an F1 score of 60.4 ± 2.7% with 0.25% training data, and giving a slight boost over using original labels (57.5 ± 2.4%). We believe that the improvement can be more dramatic in other datasets where the original labels have no meanings (such as in the numeric case), are heavily abbreviated, or contain rare words. With the numeric model, the performance suffers significantly in low-resource settings, achieving only 50.1 ± 5.3%, or 10.3% lower than the natural label model, with 0.25% data. This result further supports the importance of label semantics in our generation approach. Interestingly, we also observe that the numeric model still outperforms BERT token-level classification (44.7 ± 6.4%), where neither model contains prior label semantics. This result indicates that even in the absence of label meanings, the generation approach seems more suitable than the token-level framework.

Teaching Model to Generate via Supervised Transfer Learning
While we train our model in limited data scenarios, we are asking the model to generate a new format of output given small amount of data. This is challenging since a sequence generation framework typically requires large amount of training (Sutskever et al., 2014). Despite this challenge, our model is able to outperform the classical token-level framework with ease. This section explores a clear untapped potential -by teaching our model how to generate the augmented natural language format before adapting to new tasks, we show that the performance on limited data significantly improves. This result contrasts with the BERT token-level model where supervised transfer learning hurts overall performance compared to BERT's initial pre-training due to possible overfitting.
To conduct this experiment, we train our model with the Ontonotes NER task in order to teach it the expected output format. Then, we adapt it on another task (SNIPS) with limited data, as in Section 4.3. We compare the results with the token-level BERT model, which also uses the BERT model trained on Ontonotes for supervised pre-training. We demonstrate the results in Figure 3b as well as highlight the improvement due to supervised pretraining in Figure 3c. We also provide full numeric results in the appendix Table 11 for reference.
Our model demonstrates consistent improvement, achieving an F1 score of 63.8 ± 2.6% using 0.25% of the training dataset, compared to 60.4 ± 0.27% without supervised transfer learning. The improvement trend also continues for other data settings, as shown in Figure 3c. The benefits from transfer learning is particularly strong for the numeric label model, achieving 57.4 ± 2.9% compared to 50.1 ± 5.3% for 0.25% data. This results suggests that the initial knowledge from supervised pre-training helps the model associate its labels (without prior semantics) to the associated words more easily.
The supervised transfer learning can also be seen as a meta-learner, which teaches the model how to perform sequence labeling in the generative style. In fact, when we investigate the model output without adapting to the SNIPS dataset, in addition to the output having the correct format, it already contains relevant tagging information for new tasks.
For instance, a phrase "Onto jerrys Classical Moments in Movies" from the SNIPS dataset results in the model output "Onto jerrys [ Classical Moments in Movies Š work of art ]". This prediction closely matches the true label "Onto [ jerrys Š playlist owner ] [ Classical Moments in Movies Š playlist ]" where the true class of "Classical Moments in Movies" is playlist. Intuitively, the classification as work of art is in agreement with the true label playlist, but simply needs to be refined to match the allowed labels for the new task.
In contrast to our framework where the supervised transfer learning helps teach the model an output style, the transfer learning for the tokenlevel classification simply adapts its weights and retains the same token-level structure (albeit with a new classifier). We observe no significant improvement from supervised pre-training for the BERT token-level model, which obtains an F1 score of 46.3 ± 3.6% compared to 44.7 ± 6.4% without supervised pre-training (with 0.25% SNIPS data). The improvements are also close to zero or negative for higher data settings (Figure 3c), suggesting that the pre-training of the token-level classification might overfit to the supervised data, and results in lower generalization on other downstream tasks. Overall, the final result on the BERT model lags far behind our framework, performing 17.5% lower than our model's score for 0.25% training data.
In addition, our model with numeric labels performs much better than the BERT token-level model and further highlights the suitability of our generative output format for sequence labeling, regardless to the label semantics. Possible explanations are that the sequence to sequence label is less prone to overfitting compared to the classification framework. It could also be the case that locally tagging words with labels in the word sequence helps improve attention within the transformers model, and improve robustness to limited data.

Few-Shot Learning
In few-shot learning, we seek to train models such that given a new task, the models are able to learn efficiently from few labels. Different tasks are sampled from various data domains which differ in terms of allowed labels and other nuances such as input styles.
We define a data domain D as a set of labeled which has its set of allowed label types Y D y i . Few-shot learning approaches are evaluated over many episodes of data, which represent a variety of novel tasks. Each episode (S, Q) consists of a support set S containing K-shot labeled samples, as well as a query set Q used for evaluation. Data from the evaluation episodes are drawn from the target domains {D T 1 , D T 2 , . . .}, which the model has not previously seen.
To learn such models, we typically have access to another set of domains called the source domains {D S 1 , D S 2 , . . .}, which can be used as the training resources. In order to train the model to learn multiple tasks well, many few-shot learning approaches use meta-learning, or a learning to learn approach, where the model is trained with many episodes drawn from the source domains in order to mimic the evaluation (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018;Finn et al., 2017). We refer to this as the episodic training.
Another approach, called fine-tuning, trains the model on a regular training set from the source domains: ∪ m D S m . Given an episode (S, Q) at evaluation time, the model fine-tunes it on the support S, typically with a new classifier constructed for the new task, and evaluates on Q.  (Snell et al., 2017), which classifies by comparing a word x i to each class centroid rather than individual sample embeddings. L-TapNet + CDT Hou et al. (2020) uses a CRF framework and leverages label semantics in representing labels to calculate emission scores and a collapsed dependency transfer method to calculate transition scores. We note that all baselines except for TransferBERT uses episodic meta-training whereas TransferBERT uses fine-tuning. All baseline results are taken from Hou et al. (2020).
Our model performs fine-tuning with the generation framework. The major difference between our model and a token-level classification model such as TransferBERT is that we do not require a new classifier for every novel task during the finetuning on the support set. The sequence generation approach allows us to use the entire model and adapt it to new tasks, where the initial embeddings contain high quality semantics and help the model transfer knowledge efficiently.

K-shot Episode Construction
Traditionally, the support set S is often constructed in K-shot formats where we use only K instances of each label type. In sequence labeling problems, this definition is challenging due to the presence of multiple occurrences or multiple label types in a single sentence. We follow Hou et al. (2020) by using the following definition of a K-shot setting: All labels within the task appears at least K times in S and would appear less than K times if any sentence is removed. We sample 100 episodes from each domain according to this definition. Note that Hou et al. (2020)'s episodes are similar to ours, but preprocess the sentences by lowercasing and removing extra tokens such as commas (see details in Section A.6). Our model is flexible and can handle raw sentences; we therefore use the episodes from the original SNIPS dataset without any modifications.   D i . We refer to this as the leave-one-out metatraining sets. All other baselines also use this metatraining data setup.

Data
We note that the training set D i has data distributions that closely match D i since they are both drawn from the SNIPS dataset. We investigate more challenging scenarios where we use an alternative source as a meta-training set, as well as no meta-training. In particular, we choose Ontonotes NER task as the alternative source domain. The benefits of using this setup is such that it establishes a single meta-trained model that works across all evaluation domains, which we offer as a challenging benchmark for future research. Table 2 demonstrates the results for few-shot experiments. Our model outperforms previous stateof-the-art on every domain evaluated. In the 5-shot case, our model achieves an average F1 score of 90.9%, exceeding the strongest baseline by 15.9%. Even without meta-training, the model is able to perform on par with state-of-the-art models, achieving an F1 score of 77.3% versus 75.0% for the baseline. Training on an alternative source (NER task) also proves to be an effective meta-learning strategy, performing better than the best baseline by 7.7%. These results indicate that our model is robust in its ability to learn sequence tagging on target domains that differ from sources. In the 1-shot case, our model achieves an average F1 score of 81.0%, outperforming the best baseline significantly (10.6% improvement).

Few-Shot Results
We note that the average support sizes are around 5 to 40 sentences for the 5-shot case, and one to 8 sentences for the 1-shot case (see Table 12 and 13 for details). The results are particularly impressive given that we adapt a large transformer model based on such limited number of samples. In comparison to other fine-tuning approaches such as TransferBERT, our model performs substantially better, indicating that our generative framework is a more data-efficient approach for sequence labeling.

Discussion and Future Work
Our experiments consistently show that the generation framework is suitable for sequence labeling and sets a new record for few-shot learning. Our model adapts to new tasks efficiently with limited samples, while incorporating the label semantics expressed in natural words. This is akin to how humans learn. For instance, we do not learn the concept of "person" from scratch in a new task, but have prior knowledge that "person" likely corresponds to names, and refine this concept through observations. The natural language output space allows us to retain the knowledge from previous tasks through shared embeddings, unlike the tokenlevel model which needs new classifiers for novel tasks, resulting in a broken chain of knowledge.
Our approach naturally lends itself to life-long learning. The unified input-output format allows the model to incorporate new data from any domain. Moreover, it has the characteristics of a single, lifelong learning model that works well on many levels of data, unlike other approaches that only perform well on few-shot or high-resource tasks. Our simple yet effective approach is also easily extensible to other applications such as multi-label classification, or structured prediction via nested tagging patterns.

A.1 Experiment Setup
We describe the experiment setup for reproducibility in this section. We use Huggingface's T5-BASE conditional generation model as well as their trainer (transformers.Trainer) with its default hyperparameters to train all our models. The trainer uses AdamW (Kingma and Ba, 2015; Loshchilov and Hutter, 2019) and linear learning rate decay.
• We use 8 V100 GPUs for all our experiments.
• Maximum sequence length = 128 for all tasks except Ontonotes where we use 175.
• The number of epochs in multi-task experiments is 50.
• The number of epochs for limited-resource experiments is scaled to have the same number of optimization steps as that of when we use the entire training set. For instance, if we use 10% of the entire training set, we use 500 epochs on that limited set.
• We perform no preprocessing except for replacing the labels with the natural labels described in Section 4.1.
• We use a library seqleval 1 for F1 evaluation which supports many tagging format such as BIO, BIOES, etc.
• We use bert-base-multilingual-cased for our BERT model.

Task Descriptor Prefixes
A task descriptor helps encode information about the allowed set of label types. For instance, CoNLL allows only 4 types of labels whereas the 18 label types in Ontonotes are more fine-grained. A task descriptor also help encode the nuances among labels; for example, the CoNLL dataset has only one tag type "LOC" for location whereas Ontonotes differentiate locations with "GPE" (countries, cities, states) and a general "LOC" (non-GPE locations, mountain ranges, bodies of water). By specifying a task descriptor, we allow the model to learn the implicit constraints in the data and help it be able to distinguish what task it should perform given an input sentence. We use the corresponding prefixes "SNIPS: ", "ATIS: ", "Ontonotes: " or "CONLL: ". We ensure that these prefixes can  Table 3: Example of a sentence and its tagging label from Ontonotes. The prediction is generated from training a sequence-to-sequence model with a raw BIO format.
be tokenized given a pretrained tokenizer properly without the model having to use the unknown token. We demonstrate in 4.2 that the prefix tags allow us to perform slot labeling (and intent classification) for different datasets using a single model. For a model trained on only a single dataset, such prefix can be omitted and does not affect the performance.
A.2 A naive approach for sequence labeling as sequence to sequence We consider training a sequence-to-sequence model where the output is the sequence of BIO tags. Table 3 demonstrates an example with model prediction. We find that the model often outputs predictions that are misaligned with the original sentence slots. This is due to the complex relationships with the tokenizer. For instance, the tokenized version of this sentence (using T5-BASE tokenizer) is of length 25 whereas the original sentence length is 14. Learning to map the slot labels to the correct tokens can be challenging.

A.3 Shortened Generative Format
We show a failure case for a shorted generative format discussed in 2 where we repeat only the tagged pattern . Consider the following input s and label y, If we repeat only the tagged pattern, then the output s g will be [ two Š money ] .
Given s g and s, it is ambiguous whether the canonical label should associate with two for two dollars or two men.

A.4 Dataset Details
We provide details on the statistics of datasets used in Table 4. We provide details on the intent and slot label types in Table 7 for the SNIPS dataset and Table 8 for the ATIS dataset. The tagging label types and their label mapping are listed in Table 9 for Ontonotes and 5 for CoNLL.

A.5 Low-Resource Results
We provide full numeric results for low-resource experiments from Section 4.3 and 4.4 in Table 10 and Table 11 respectively.

A.6 Few-Shot Experiments
We provide details on the data used for our few-shot experiments and the full results in this Section.

A.6.1 Episode Data
We use two data constructions: the original episodes from Hou et al. (2020) and our own constructed episodes with Hou et al. (2020)'s definition. We provide the episode statistics (Ave |S|) in Table 12 for both constructions which demonstrate that the support size are comparable for each domain. The major difference is that Hou et al.
(2020) pre-processes data by lowercasing all letters and removing extra tokens such as commas and apostrophes. In addition, Hou et al. (2020) modify the BIO prefixes in cases where the tokenization splits a token with the "B-" prefix into two or more units. For instance, the token ["lora's"] with tag [B-playlist owner] becomes ["lora", "s"] with tags [B-playlist owner, I-playlist owner]. This treatment considerably increases the number of tokens with "I-" tags in the episodes created by Hou et al. (2020). Both data have 100 episodes with 20 query sentences. We provide the results for our episodes with lowercased words and Hou's episodes, which shows the similarity between two settings.
We note that the support size for some domains can be smaller than in other domains, according to Hou et al. (2020) K-shot definition. For instance, domain CR has around 5 sentences on average whereas domain RE has more than 30 sentences. This is because for some domains, there can be many tags of the same types in a single sentence.  results trained on Ontonotes NER task (see label types in Table 7), which is from a different domain that the SNIPS dataset. Ours w/o meta involves no additional meta-training and fine-tunes on our backbone model directly.

A.7 1-Shot Results
Our models tend to perform well when there are sufficient enough sentences to fine-tune on. For domain 'Cr' (SearchCreativeWork) where there are highly limited number of sentences (5 sentences on average), our model does not perform well compared to other baselines. This observation is consistent with the 1-shot results, which we include in the appendix Section A.7 Table 13, where we typically have less than 10 sentences in the support set. In this case, our model performs comparable to Warm Proto Zero with BERT on average but is outperformed by L-TapNet+CDT. Other techniques to improve on the 1-shot learning result include bootstrapping more sentences from unlabeled corpus with the labels from the support set for better optimization.      Table 12: Our 5-shot slot tagging results on 7 domains of the SNIPS dataset. We provide the average and the standard deviation of F1 scores over 100 episodes. Ours-l indicate that we use lowercased words for input sentences.