Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network

In this paper, we explore the slot tagging with only a few labeled support sentences (a.k.a. few-shot). Few-shot slot tagging faces a unique challenge compared to the other fewshot classification problems as it calls for modeling the dependencies between labels. But it is hard to apply previously learned label dependencies to an unseen domain, due to the discrepancy of label sets. To tackle this, we introduce a collapsed dependency transfer mechanism into the conditional random field (CRF) to transfer abstract label dependency patterns as transition scores. In the few-shot setting, the emission score of CRF can be calculated as a word’s similarity to the representation of each label. To calculate such similarity, we propose a Label-enhanced Task-Adaptive Projection Network (L-TapNet) based on the state-of-the-art few-shot classification model – TapNet, by leveraging label name semantics in representing labels. Experimental results show that our model significantly outperforms the strongest few-shot learning baseline by 14.64 F1 scores in the one-shot setting.


Introduction
Slot tagging (Tur and De Mori, 2011), a key module in the task-oriented dialogue system (Young et al., 2013), is usually formulated as a sequence labeling problem (Sarikaya et al., 2016). Slot tagging faces the rapid changing of domains, and the labeled data is usually scarce for new domains with only a few samples. Few-shot learning technique (Miller et al., 2000;Lake et al., 2015;Vinyals et al., 2016) is appealing in this scenario since it learns the model that borrows the prior experience from old domains and adapts to new domains quickly with only very few examples (usually one or two examples for each class). Previous few-shot learning studies mainly focused on classification problems, which have been widely explored with similarity-based methods (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018;Yu et al., 2018). The basic idea of these methods is classifying an (query) item in a new domain according to its similarity with the representation of each class. The similarity function is usually learned in prior rich-resource domains and per class representation is obtained from few labeled samples (support set). It is straightforward to decompose the few-shot sequence labeling into a series of independent few-shot classifications and apply the similarity-based methods. However, sequence labeling benefits from taking the dependencies between labels into account (Huang et al., 2015;Ma and Hovy, 2016). To consider both the item similarity and label dependency, we propose to leverage the conditional random fields (Lafferty et al., 2001, CRFs) in few-shot sequence labeling (see Figure 1). In this paper, we translate the emission score of CRF into the output of the similarity-based method and calculate the transition score with a specially designed transfer mechanism.
The few-shot scenario poses unique challenges in learning the emission and transition scores of CRF. It is infeasible to learn the transition on the few labeled data, and prior label dependency in source domain cannot be directly transferred due to discrepancy in label set. To tackle the label discrepancy problem, we introduce the collapsed dependency transfer mechanism. It transfers label dependency information from source domains to target domains by abstracting domain-specific labels into abstract domain-independent labels and modeling the label dependencies between these abstract labels.
It is also challenging to compute the emission scores (word-label similarity in our case). Popular few-shot models, such as Prototypical Network (Snell et al., 2017), average the embeddings of each label's support examples as label representations, which often distribute closely in the embedding space and thus cause misclassification. To remedy this, Yoon et al. (2019) propose TapNet that learns to project embedding to a space where words of different labels are well-separated. We introduce this idea to slot tagging and further propose to improve label representation by leveraging the semantics of label names. We argue that label names are often semantically related to slot words and can help word-label similarity modeling. For example in Figure 1, word rain and label name weather are highly related. To use label name semantic and achieve good-separating in label representation, we propose Label-enhanced TapNet (L-TapNet) that constructs an embedding projection space using label name semantics, where label representations are well-separated and aligned with embeddings of both label name and slot words. Then we calculate similarities in the projected embedding space. Also, we introduce a pair-wise embedding mechanism to representation words with domain-specific context.
One-shot and five-shot experiments on slot tagging and named entity recognition show that our model achieves significant improvement over the strong few-shot learning baselines. Ablation tests demonstrate improvements coming from both L-TapNet and collapsed dependency transfer. Further analysis for label dependencies shows it captures non-trivial information and outperforms transition based on rules.
Our contributions are summarized as follows: (1) We propose a few-shot CRF framework for slot tagging that computes emission score as wordlabel similarity and estimate transition score by transferring previously learned label dependencies.
(2) We introduce the collapsed dependency transfer mechanism to transfer label dependencies across domains with different label sets. (3) We propose the L-TapNet to leverage semantics of label names to enhance label representations, which help to model the word-label similarity.

Problem Definition
We define sentence x = (x 1 , x 2 , . . . , x n ) as a sequence of words and define label sequence of the sentence as y = (y 1 , y 2 , . . . , y n ). A domain is a set of (x, y) pairs. For each domain, there is a corresponding domainspecific label set To simplify the description, we assume that the number of labels N is same for all domains.
As shown in Figure 2, few-shot models are usually first trained on a set of source domains {D 1 , D 2 , . . .}, then directly work on another set of unseen target domains {D 1 , D 2 , . . .} without fine-tuning. A target domain D j only contains few labeled samples, which is called support set The K-shot sequence labeling task is defined as follows: given a K-shot support set S and an input query sequence x = (x 1 , x 2 , . . . , x n ), find x's best label sequence y * : y * = (y 1 , y 2 , . . . , y n ) = arg max y p(y | x, S).

Model
In this section, we first show the overview of the proposed CRF framework ( §3.1). Then we discuss how to compute label transition score with collapsed dependency transfer ( §3.2) and compute emission score with L-TapNet ( §3.3).

Framework Overview
Conditional Random Field (CRF) considers both the transition score and the emission score to find the global optimal label sequence for each input. Following the same idea, we build our few-shot slot tagging framework with two components: Transition Scorer and Emission Scorer.
We apply the linear-CRF to the few-shot setting by modeling the label probability of label y given query sentence x and a K-shot support set S: where Z = y ∈Y exp(TRANS(y ) + λ · EMIT(y , x, S)), Support Set: search [O] songs [O] of [O] celine [B-time] dion [I-time] play [O] black [B-music] bird [I-music] of [O] beatles [B-artist] Query (x,y): play [O] the [O] hey [B-music] jude [B-music] Label set: {O, B-music, I-music, B-artist, I-artist} Support Set: are [O] there [O] hospitals [B-org] near [B-dist] me [I-dist] show [O] the [O] closest [B-dist] rest [B-pos] station [I-pos] Query (x,y): where [O] is [O] the [O] nearest [B-dist] shop [B-pos] Label set: Sample (1) Sample (k) Support Set: is [O] it [O] strong [B-weather] wind [I-weather] outside [O] will [O] it [O] snow [B-weather] next [B-team] friday [I-team] Query x: will it rain tonight We learn a collapsed label transitionT and obtain specific label transition T by filling each position of it with value from T in the same color.
Scorer output and EMIT(y, is the Emission Scorer output. λ is a scaling parameter which balances weights of the two scores. We take L CRF = − log(p(y | x, S)) as loss function and minimize it on data from source domains. After the model is trained, we employ Viterbi algorithm (Forney, 1973) to find the best label sequence for each input.

Transition Scorer
The transition scorer component captures the dependencies between labels. 2 We model the label dependency as the transition probability between two labels: Conventionally, such probabilities are learned from training data and stored in a transition matrix T N ×N , where N is the number of labels. For example, T B-loc,B-team corresponds to p(B-loc | B-team). But in the few-shot setting, a model faces different label sets in the source domains (train) and the target domains (test). This mismatch on labels blocks the trained transition scorer directly working on a target domain.

Collapsed Dependency Transfer Mechanism
We overcome the above issue by directly model-2 Here, we ignore Start and End labels for simplicity. In practice, Start and End are included as two additional abstract labels.
ing the transition probabilities between abstract labels. Intuitively, we collapse specific labels into three abstract labels: O, B and I. To distinguish whether two labels are under the same or different semantics, we model transition from B and I to the same B (sB), a different B (dB), the same I (sI) and a different I (dI). We record such abstract label transition with a TableT 3×5 (see Figure 3). For example,T B,sB = p(Bm | Bm ) is the transition probability of two same B labels. And T B,dI = p(In | Bm ) is the transition probability from a B label to an I label with different types, where m = n .T O,sB andT O,sI respectively stands for the probability of transition from O to any B or I label.
To calculate the label transition probability for a new domain, we construct the transition matrix T by filling it with values inT . Figure 3 shows the filling process, where positions in the same color are filled by the same values. For example, we fill T B-loc,B-team with value inT B,dB .

Emission Scorer
As shown in Figure 4, the emission scorer independently assigns each word an emission score with regard to each label: In few-shot setting, a word's emission score is calculated according to its similarity to representations of each label. To compute such emission, we propose the L-TapNet by improving TapNet (Yoon et al., 2019) with label semantics and prototypes.

Task-Adaptive Projection Network
TapNet is the state-of-the-art few-shot image classification model. Previous few-shot models, such as Prototypical Network, average the embeddings of each label's support example as label representations and directly compute word-label similarity in word embedding space. Different from them, TapNet calculates word-label similarity in a projected embedding space, where the words of different labels are well-separated. That allows TapNet to reduce misclassification. To achieve this, Tap-Net leverages a set of per-label reference vectors Φ = [φ 1 ; · · · ; φ N ] as label representations. and construct a projection space based on these references. Then, a word x's emission score for label j is calculated as its similarity to reference φ j : where M is a projecting function, E is an embedder and SIM is a similarity function. TapNet shares the references Φ across different domains and constructs M for each specific domain by randomly associating the references to the specific labels.

Task-Adaptive Projection Space Construction
Here, we present a brief introduction for the construction of projection space. Let c j be the average of the embedded features for words with label j in support set S. Given the Φ = [φ 1 ; · · · ; φ N ] and support set S, TapNet constructs the projector M such that (1) each c j and corresponding reference vector φ j align closely when projected by M. (2) words of different labels are well-separated when projected by M.
To achieve these, TapNet first computes the alignment bias between c j and φ j in original embedding space, then it finds a projection M that eliminates this alignment bias and effectively separates different labels at the same time. Specifically, TapNet takes the matrix solution of a linear error nulling process as the embedding projector M. For the detail process, refer to the original paper.

Label-enhanced TapNet
As mentioned in the introduction, we argue that label names often semantically relate to slot words and can help word-label similarity modeling. To enhance TapNet with such information, we use label semantics in both label representation and construction of projection space.
Projection Space with Label Semantics Let prototype c j be the average of embeddings of words with label j in support set. And s j is semantic representation of label j and Section 3.3.3 will introduce how to obtain it in detail. Intuitively, slot values (c j ) and corresponding label name (s j ) often have related semantics and they should be close [O] strong [B-weather] wind [I-weather] outside [O] will [O] it [O] snow [B-weather] next [B-team] friday [I-team] Linear Error Nulling Projection Space will it rain tonight Figure 4: Emission Scorer with L-TapNet. It first constructs a projection space M by linear error nulling for given domain, and then predicts a word's emission score with its distance to label representation Ω in the projection space.
in embedding space. So, we find a projector M that aligns c j to both φ j and s j . The difference with TapNet is that it only aligns c j to references φ j but we also require alignments with label representation. The label-enhanced reference is calculated as: where α is a balance factor. Label semantics s j makes M specific for each domain. And reference φ j provides cross domain generalization. Then we construct an M by linear error nulling of alignment error between label enhanced reference ψ j and c j following the same steps of TapNet.
Emission Score with Label Semantic For emission score calculation, compared to TapNet that only uses domain-agnostic reference φ as label representation, we also consider the label semantics and use the label-enhanced reference ψ j in label representation.
Besides, we further incorporate the idea of Prototypical Network and represent a label using a prototype reference c j as Ω j = (1 − β) · c j + βψ j . Finally, the emission score of x is calculated as its similarity to label representation Ω: where SIM is the dot product similarity function and E is a word embedding function which will be introduced in the next section.

Embeddings for Word and Label Name
For the word embedding function E, we proposed a pair-wise embedding mechanism. As shown in Figure 5, a word tends to mean differently when concatenated to a different context. To tackle the representation challenges for similarity computation, we consider the special query-support setting in few-shot learning and embed query and ding query and support sentences separately (left), it is hard to tag blackbird according to its similarity to labels. But if we embed query by pairing it with different support sentences (right), the domain specific context provide blackbird certain meanings close to pet and song respectively.  |S|" corresponds to the average support set size of each domain. And "Sample" stands for the number of few-shot samples we build from each domain. support words pair-wisely. Such pair-wise embedding can make use of domain-related context in support sentences and provide domain adaptive embeddings for the query words. This will further help to model the query words' similarity to domain-specific labels. To achieve this, we represent each word with self-attention over both query and support words. We first copy query sentence x for N S = |S| times, and pair them with all support sentences. Then the N S pairs are passed to a BERT (Devlin et al., 2019) to get N S embeddings for each query word. We represent each word as the average of N S embeddings. Now, representations of query words are conditioned on domain-specific context. We use BERT as it can naturally capture the relation between sentence pairs. To get label representation s, we first concatenate abstract label name (e.g., begin and inner) and label name (e.g., weather). Then, we insert a [CLS] token at the first position, and input them into a BERT. Finally, the representation of [CLS] is used as the label semantic embedding.

Experiment
We evaluate the proposed method on slot tagging and test its generalization ability on a similar sequence labeling task: name entity recognition (NER). Due to space limitation, we only present the detailed results for 1-shot/5-shot slot tagging, which transfers the learned knowledge from source domains (training) to an unseen target domain (testing) containing only a 1-shot/5-shot support set. The results of NER are consistent and we present them in the supplementary Appendix B.

Settings
Dataset For slot tagging, we exploit the snips dataset (Coucke et al., 2018), because it contains 7 domains with different label sets and is easy to simulate the few-shot situation. The domains are Weather (We), Music (Mu), PlayList (Pl), Book (Bo), Search Screen (Se), Restaurant (Re) and Creative Work (Cr). Information about original datasets is shown in Appendix A.
To simulate the few-shot situation, we construct the few-shot datasets from original datasets, where each sample is the combination of a query data (x q , y q ) and corresponding K-shot support set S. Table 1 shows the overview of the experiment data.
Few-shot Data Construction Different from the simple classification of single words, slot tagging is a structural prediction problem over the entire sentence. So we construct support sets with sentences rather than single words under each tag.
As a result, the normal N-way K-shot few-shot definition is inapplicable for few-shot slot tagging. We cannot guarantee that each label appears K times while sampling the support sentences, because different slot labels randomly co-occur in one sentence. For example in Figure 1, in the 1-shot support set, label [B-weather] occurs twice to ensure all labels appear at least once. So we approximately construct K-shot support set S following two criteria: (1) All labels within the domain appear at least K times in S. (2) At least one label will appear less than K times in S if any (x, y) pair is removed from it. Algorithm 1 shows the detail process. 3 Here, we take the 1-shot slot tagging as an example to illustrate the data construction procedure. For each domain, we sample 100 different 1-shot support sets. Then, for each support set, we sample 20 unincluded utterances as queries (query set). Each support-query-set pair forms one few-shot episode.
Algorithm 1: Minimum-including Input: # of shot K, domain D, label set LD 1: Initialize support set S = {}, Count j = 0 (∀ j ∈ LD) 2: for in LD do while Count < k do From D \ S, randomly sample a (x (i) , y (i) ) pair that y (i) includes Add (x (i) , y (i) ) to S Update all Count j (∀ j ∈ LD)

4: Return S
Eventually, we get 100 episodes and 100 × 20 samples (1 query utterance with a support set) for each domain.
Evaluation To test the robustness of our framework, we cross-validate the models on different domains. Each time, we pick one target domain for testing, one domain for development, and use the rest domains as source domains for training. So for slot tagging, all models are trained on 10,000 samples, and validated as well as tested on 2,000 samples respectively. When testing model on a target domain, we evaluate F1 scores within each few-shot episode. 4 Then we average 100 F1 scores from all 100 episodes as the final result to counter the randomness from support-sets. All models are evaluated on same support-query-set pairs for fairness.
To control the nondeterministic of neural network training (Reimers and Gurevych, 2017), we report the average score of 10 random seeds.
Hyperparameters We use the uncased BERT-Base (Devlin et al., 2019) to calculate contextual embeddings for all models. We use ADAM (Kingma and Ba, 2015) to train the models with batch size 4 and a learning rate of 1e-5. For the CRF framework, we learn the scaling parameter λ during training, which is important to get stable results. For L-TapNet, we set α as 0.5 and β as 0.7. We fine-tune BERT with Gradual Unfreezing trick (Howard and Ruder, 2018). For both proposed and baseline models, we take early stop in training and fine-tuning when there is no loss decay withing a fixed number of steps.

Baselines
Bi-LSTM is a bidirectional LSTM (Schuster and Paliwal, 1997) with GloVe (Pennington et al., 2014) embedding for slot tagging. It is trained on the support set and tested on the query samples.
SimBERT is a model that predicts labels according to cosine similarity of word embedding of nonfine-tuned BERT. For each word x j , SimBERT finds its most similar word x k in support set, and the label of x j is predicted to be the label of x k .
TransferBERT is a domain transfer model with the NER setting of BERT (Devlin et al., 2019). We pretrain the it on source domains and select the best model on the same dev set of our model. We deal with label mismatch by only transferring bottleneck feature. Before testing, we fine-tune it on target domain support set. Learning rate is set as 1e-5 in training and fine-tuning.
WarmProtoZero (WPZ) (Fritzler et al., 2019) is a few-shot sequence labeling model that regards sequence labeling as classification of every single word. It pre-trains a prototypical network (Snell et al., 2017) on source domains, and utilize it to do word-level classification on target domains without training. Fritzler et al. (2019) use randomly initialized word embeddings. To eliminate the influence of different embedding methods, we further implement WPZ with the pre-trained embedding of GloVe (Pennington et al., 2014) and BERT.
Matching Network (MN) is similar to WPZ. The only difference is that we employ the matching network (Vinyals et al., 2016) with BERT embedding for classification. Table 2 shows the 1shot slot tagging results. Each column respectively shows the F1 scores of taking a certain domain as target domain (test) and use others as source domain (train & dev). As shown in the tables, our L-TapNet+CDT achieves the best performance. It outperforms the strongest few-shot learning baseline WPZ+BERT by average F1 scores of 14.64.

Results of 1-shot Setting
Our model significantly outperforms Bi-LSTM and TransferBERT, indicating that the number of labeled data under the few-shot setting is too scarce for both conventional machine learning and transfer   learning models. Moreover, the performance of SimBERT demonstrates the superiority of metricbased methods over conventional machine learning models in the few-shot setting.
The original WarmProtoZero (WPZ) model suffers from the weak representation ability of its word embeddings. When we enhance it with GloVe and BERT word embeddings, its performance improves significantly. This shows the importance of embedding in the few-shot setting. Matching Network (MN) performs poorly in both settings. This is largely due to the fact that MN pays attention to all support word equally, which makes it vulnerable to the unbalanced amount of O-labels.
More specifically, those models that are finetuned on support set, such as Bi-LSTM and Trans-ferBERT, tend to predict tags randomly. Those systems can only handle the cases that are easy to generalize from support examples, such as tags for proper noun tokens (e.g. city name and time). This shows that fine-tuning on extremely limited examples leads to poor generalization ability and undertrained classifier. And for those metric based methods, such as WPZ and MN, label prediction is much more reasonable. However, these models are easy to be confused by similar labels, such as current location and geographic poi. It indicates the necessity of well-separated label representations. Also illegal label transitions are very common, which can be well tackled by the proposed collapsed dependency transfer.
To eliminate unfair comparisons caused by additional information in label names, we propose the L-WPZ+CDT by enhancing the WarmProtoZero (WPZ) model with label name representation same to L-TapNet and incorporating it into the proposed CRF framework. It combines label name embedding and prototype as each label representation. Its improvements over WPZ mainly come from label semantics, collapsed dependency transfer and pair-wise embedding. L-TapNet+CDT outperforms L-WPZ+CDT by 4.79 F1 scores demonstrating the effectiveness of embedding projection. When compared with TapNet+CDT, L-TapNet+CDT achieves an improvement of 4.54 F-score on average, which shows that considering label semantics and prototype helps improve emission score calculation. Table 3 shows the results of 5-shots experiments, which verify the proposed model's generalization ability in more shots situations. The results are consistent with 1-shot setting in general trending.

Analysis
Ablation Test To get further an understanding of each component in our method (L-TapNet+CDT), we conduct ablation analysis on both 1-shot and 5-shots setting in Table 4. Each component of our method is removed respectively, including: collapsed dependency transfer, pair-wise embedding, label semantic, and prototype reference.
When collapsed dependency transfer is removed, we directly predict labels with emission score and huge F1 score drops are witnessed in all settings. This ablation demonstrates a great necessity for considering label dependency.
For our method without pair-wise embedding, we represent query and support sentences independently. We address the drop to the fact that support sentences can provide domain-related context, and pair-wise embedding can leverage such context and provide domain-adaptive representation for words in query sentences. This helps a lot when computing a word's similarity to domain-specific labels.
When we remove the label-semantic from L-TapNet, the model degenerates into TapNet+CDT enhanced with prototype in emission score. The drops in results show that considering label name can provide better label representation and help to model word-label similarity. Further, we also tried to remove the inner and beginning words in label representation and observe a 0.97 F1-score drop on 1-shot SNIPS. It shows that distinguishing B-I labels in label semantics can help tagging.
And if we calculate emission score without the prototype reference, the model loses more performance in 5-shots setting. This meets the intuition that prototype allows model to benefit more from the increase of support shots, as prototypes are directly derived from the support set.

Analysis of Collapsed Dependency Transfer
While collapsed dependency transfer (CDT) brings significant improvements, two natural questions arise: whether CDT just learns simple transition rules and why it works.   To answer the first question, we replace CDT with transition rules in Table 5, 5 which shows CDT can bring more improvements than transition rules.
To have a deeper insight into the effectiveness of CDT, we conduct an accuracy analysis of it. We assess the label predicting accuracy of different types of label bi-grams. The result is shown in Table  6. We further summarize the bi-grams into 2 categories: Border includes the bi-grams across the border of a slot span; Inner is the bi-grams within a slot span. We argue that improvements of Inner show successful reduction of illegal label transition from CDT. Interestingly, we observe that CDT also brings improvements by correctly predict the first and last token of a slot span. The results of Border verified our observation that CDT may helps to decide the boundaries of slot spans more accurately, which is hard to achieve by adding transition rules.

Related Works
Traditional few-shot learning methods depend highly on hand-crafted features (Fei-Fei, 2006;Fink, 2005). Classical methods primarily focus on metric learning (Snell et al., 2017;Vinyals et al., 2016), which classifies an item according to its similarity to each class's representation. Recent efforts (Lu et al., 2018;Schwartz et al., 2019) propose to leverage the semantics of class name to enhance class representation. However, different from us, these methods focus on image classification where effects of name semantic are implicit and label dependency is not required.
Few-shot learning in natural language process-  ing has been explored for classification tasks, including text classification Geng et al., 2019;Yu et al., 2018), entity relation classification Gao et al., 2019;Ye and Ling, 2019), and dialog act prediction (Vlasov et al., 2018 share a similar idea to us in using label name semantics, but has a different setting as few-shot methods are additionally supported by a few labeled sentences. Chen et al. (2016) investigate using label name in intent detection. In addition to learning directly from limited example, another research line of solving data scarcity problem in NLP is data augmentation (Fader et al., 2013;Zhang et al., 2015;Liu et al., 2017). For data augmentation of slot tagging, sentence generation based methods are explored to create additional labeled samples (Hou et al., 2018;.

Conclusion
In this paper, we propose a few-shot CRF model for slot tagging of task-oriented dialogue. To compute transition score under few-shot setting, we propose the collapsed dependency transfer mechanism, which transfers the prior knowledge of the label dependencies across domains with different label sets. And we propose L-TapNet to calculate emission score, which improves label representation with label name semantics. Experiment results validate that both the collapsed dependency transfer and L-TapNet can improve the tagging accuracy.   Here, "Ave. |S|" corresponds to the average support set size of each domain. And "Sample" stands for the number of few-shot samples we build from each domain.

Appendices A Detail of Dataset
Experiment Data for Few-shot NER For named entity recognition, we utilize 4 different datasets: CoNLL-2003 (Sang andMeulder, 2003), GUM (Zeldes, 2017), WNUT-2017(Derczynski et al., 2017 and Ontonotes (Pradhan et al., 2013), each of which contains data from only 1 domain. The 4 domains are News, Wiki, Social and Mixed. Detail of the original data set is showed in Table 7 and statistic of constructed few-shot data is showed in Table 8.  1-shot and 5-shots Results for NER Table 9 and Table 10 respectively show the 1-shot and 5shots name entity recognition results. Our best model outperforms all baseline in both settings. The trend of results is consistent with slottagging results. But the overall score is much lower than slot-tagging results. this is because NER domains are from different datasets and the domain gap is much larger.
Our improvements on 5-shots is narrowed in margin. This is because NER domains have different genres and vocabulary. So compared to SNIPS, it is harder to transfer knowledge but benefits more to rely on domain-specific support examples. This trend is even more pronounced with more shots. In 5-shots setting, the strongest baseline WPZ benefits more from the increased shots because it only uses support set for prediction. But the benefit of more shots is weaker for our model because it uses more prior knowledge.
Ablation Analysis on NER We investigate effectiveness of collapsed dependency transfer and label semantic on the NER task. We perform ablations on two proposed components and observe performance drops on both 1-shot and 5-shots settings, which demonstrate the generalization ability of proposed two mechanism.  Table 12 and 13 show the complete results with standard deviations for slot tagging task.