Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling

As an essential task in task-oriented dialog systems, slot filling requires extensive training data in a certain domain. However, such data are not always available. Hence, cross-domain slot filling has naturally arisen to cope with this data scarcity problem. In this paper, we propose a Coarse-to-fine approach (Coach) for cross-domain slot filling. Our model first learns the general pattern of slot entities by detecting whether the tokens are slot entities or not. It then predicts the specific types for the slot entities. In addition, we propose a template regularization approach to improve the adaptation robustness by regularizing the representation of utterances based on utterance templates. Experimental results show that our model significantly outperforms state-of-the-art approaches in slot filling. Furthermore, our model can also be applied to the cross-domain named entity recognition task, and it achieves better adaptation performance than other existing baselines. The code is available at https://github.com/zliucr/coach.


Introduction
Slot filling models identify task-related slot types in certain domains for user utterances, and are an indispensable part of task-oriented dialog systems. Supervised approaches have made great achievements in the slot filling task (Goo et al., 2018;, where substantial labeled training samples are needed. However, collecting large numbers of training samples is not only expensive but also time-consuming. To cope with the data scarcity issue, we are motivated to investigate cross-domain slot filling methods, which leverage knowledge learned in the source domains and adapt the models to the target domain with a minimum number of target domain labeled training samples.
A challenge in cross-domain slot filling is to handle unseen slot types, which prevents general Step 1 Step 2 Step 2 (b) Our proposed framework, Coach. classification models from adapting to the target domain without any target domain supervision signals. Recently, Bapna et al. (2017) proposed a cross-domain slot filling framework, which enables zero-shot adaptation. As illustrated in Figure 1a, their model conducts slot filling individually for each slot type. It first generates word-level representations, which are then concatenated with the representation of each slot type description, and the predictions are based on the concatenated features for each slot type. Due to the inherent variance of slot entities across different domains, it is difficult for this framework to capture the whole slot entity (e.g., "latin dance cardio" in Figure 1a) in the target domain. There also exists a multiple prediction problem. For example, "tune" in Figure 1a could be predicted as "B" for both "music item" and "playlist", which would cause additional trouble for the final prediction. We emphasize that in order to capture the whole slot entity, it is pivotal for the model to share its parameters for all slot types in the source domains and learn the general pattern of slot entities. Therefore, as depicted in Figure 1b,

Regularization Loss
Step One Step Two Can you put this music item onto playlist Incorrect ...
Can you put this object name onto city Figure 2: Illustration of our framework, Coach, and the template regularization approach.
a coarse-to-fine approach. It first coarsely learns the slot entity pattern by predicting whether the tokens are slot entities or not. Then, it combines the features for each slot entity and predicts the specific (fine) slot type based on the similarity with the representation of each slot type description. In this way, our framework is able to avoid the multiple predictions problem. Additionally, we introduce a template regularization method that delexicalizes slot entity tokens in utterances into different slot labels and produces both correct and incorrect utterance templates to regularize the utterance representations. By doing so, the model learns to cluster the representations of semantically similar utterances (i.e., in the same or similar templates) into a similar vector space, which further improves the adaptation robustness. Experimental results show that our model surpasses the state-of-the-art methods by a large margin in both zero-shot and few-shot scenarios. In addition, further experiments show that our framework can be applied to cross-domain named entity recognition, and achieves better adaptation performance than other existing frameworks.

Related Work
Coarse-to-fine methods in NLP are best known for syntactic parsing (Charniak et al., 2006;Petrov, 2011). Zhang et al. (2017) reduced the search space of semantic parsers by using coarse macro grammars. Different from the previous work, we apply the idea of coarse-to-fine into cross-domain slot filling to handle unseen slot types by separating the slot filling task into two steps (Zhai et al., 2017;Guerini et al., 2018).
Coping with low-resource problems where there are zero or few existing training samples has always been an interesting and challenging task (Kingma et al., 2014;Lample et al., 2018;Liu et al., 2019a,b;. Cross-domain adaptation addresses the data scarcity problem in low-resource target domains (Pan et al., 2010;Jaech et al., 2016;Guo et al., 2018;Jia et al., 2019;. However, most research studying the cross-domain aspect has not focused on predicting unseen label types in the target domain since both source and target domains have the same label types in the considered tasks (Guo et al., 2018). In another line of work, to bypass unseen label types, Ruder and Plank (2018)

Coach Framework
As depicted in Figure 2, the slot filling process in our Coach framework consists of two steps. In the first step, we utilize a BiLSTM-CRF structure (Lample et al., 2016) to learn the general pattern of slot entities by having our model predict whether tokens are slot entities or not (i.e., 3-way classification for each token). In the second step, our model further predicts a specific type for each slot entity based on the similarities with the description representations of all possible slot types. To generate representations of slot entities, we leverage another encoder, BiLSTM (Hochreiter and Schmidhuber, 1997), to encode the hidden states of slot entity tokens and produce representations for each slot entity.
We represent the user utterance with n tokens as w = [w 1 , w 2 , ..., w n ], and E denotes the embedding layer for utterances. The whole process can be formulated as follows: where [p 1 , p 2 , ..., p n ] are the logits for the 3-way classification. Then, for each slot entity, we take its hidden states to calculate its representation: where r k denotes the representation of the k th slot entity, [h i , h i+1 , ..., h j ] denotes the BiLSTM hidden states for the k th slot entity, M desc ∈ R ns×ds is the representation matrix of the slot description (n s is the number of possible slot types and d s is the dimension of slot descriptions), and s k is the specific slot type prediction for this k th slot entity. We obtain the slot description representation r desc ∈ R ds by summing the embeddings of the N slot description tokens (similar to Shah et al. (2019)): where t i is the i th token and E is the same embedding layer as that for utterances.

Template Regularization
In many cases, similar or the same slot types in the target domain can also be found in the source domains. Nevertheless, it is still challenging for the model to recognize the slot types in the target domain owing to the variance between the source domains and the target domain. To improve the adaptation ability, we introduce a template regularization method. As shown in Figure 2, we first replace the slot entity tokens in the utterance with different slot labels to generate correct and incorrect utterance templates. Then, we use BiLSTM and an attention layer (Felbo et al., 2017) to generate the utterance and template representations: where h t is the BiLSTM hidden state in the t th step, w a is the weight vector in the attention layer and R is the representation for the input utterance or template.
We minimize the regularization loss functions for the right and wrong templates, which can be formulated as follows: where R u is the representation for the user utterance, R r and R w are the representations of right and wrong templates, we set β as one, and MSE denotes mean square error. Hence, in the training phase, we minimize the distance between R u and R r and maximize the distance between R u and R w . To generate a wrong template, we replace the correct slot entity with another random slot entity, and we generate two wrong templates for each utterance. To ensure the representations of the templates are meaningful (i.e., similar templates have similar representations) for training R u , in the first several epochs, the regularization loss is only to optimize the template representations, and in the following epochs, we optimize both template representations and utterance representations. By doing so, the model learns to cluster the representations in the same or similar templates into a similar vector space. Hence, the hidden states of tokens that belong to the same slot type tend to be similar, which boosts the robustness of these slot types in the target domain.

Dataset
We evaluate our framework on SNIPS (Coucke et al., 2018), a public spoken language understanding dataset which contains 39 slot types across seven domains (intents) and ∼2000 training samples per domain. To test our framework, each time, we choose one domain as the target domain and the other six domains as the source domains.  Moreover, we also study another adaptation case where there is no unseen label in the target domain. We utilize the CoNLL-2003 English named entity recognition (NER) dataset as the source domain (Tjong Kim Sang and De Meulder, 2003), and the CBS SciTech News NER dataset from Jia et al. (2019) as the target domain. These two datasets have the same four types of entities, namely, PER (person), LOC (location), ORG (organization), and MISC (miscellaneous).

Baselines
We use word-level (Bojanowski et al., 2017) and character-level (Hashimoto et al., 2017) embeddings for our model as well as all the following baselines.
Concept Tagger (CT) Bapna et al. (2017) proposed a slot filling framework that utilizes slot descriptions to cope with the unseen slot types in the target domain.
Robust Zero-shot Tagger (RZT) Based on CT, Shah et al. (2019) leveraged example values of slots to improve robustness of cross-domain adaptation.
BiLSTM-CRF This baseline is only for the cross-domain NER. Since there is no unseen label in the NER target domain, the BiLSTM-CRF (Lample et al., 2016) uses the same label set for the source and target domains and casts it as an entity classification task for each token, which is applicable in both zero-shot and few-shot scenarios.

Training Details
We use a 2-layer BiLSTM with a hidden size of 200 and a dropout rate of 0.3 for both the template encoder and utterance encoder. Note that the parameters in these two encoders are not shared. The BiLSTM for encoding the hidden states of slot entity tokens has one layer with a hidden size of 200, which would output the same dimension as the concatenated word-level and char-level embeddings. We use Adam optimizer with a learning rate of 0.0005. Cross-entropy loss is leveraged to train the 3-way classification in the first step, and the specific slot type predictions are used in the second step. We split 500 data samples in the target domain as the validation set for choosing the best model and the remainder are used for the test set. We implement the model in CT and RZT and follow the same setting as for our model for a fair comparison.

Cross-domain Slot Filling
Quantitative Analysis As illustrated in Table 1, we can clearly see that our models are able to achieve significantly better performance than the current state-of-the-art approach (RZT). The CT framework suffers from the difficulty of capturing the whole slot entity, while our framework is able to recognize the slot entity tokens by sharing its parameters across all slot types. Based on the CT framework, the performance of RZT is still limited, and Coach outperforms RZT by a ∼3% F1-score in the zero-shot setting. Additionally, template regularization further improves the adaptation robustness by helping the model cluster the utterance representations into a similar vector space based on their corresponding template representations.
Interestingly, our models achieve impressive performance in the few-shot scenario. In terms of the averaged performance, our best model (Coach+TR) outperforms RZT by ∼8% and ∼9% F1-scores on the 20-shot and 50-shot settings, respectively. We conjecture that our model is able to better recognize the whole slot entity in the target domain and map the representation of the slot entity belonging to the same slot type into a similar vector space  to the representation of this slot type based on Eq (4). This enables the model to quickly adapt to the target domain slots.

Analysis on Seen and Unseen Slots
We take a further step to test the models on seen and unseen slots in target domains to analyze the effectiveness of our approaches. To test the performance, we split the test set into "unseen" and "seen" parts. An utterance is categorized into the "unseen" part as long as there is an unseen slot (i.e., the slot does not exist in the remaining six source domains) in it. Otherwise we categorize it into the "seen" part. The results for the "seen" and "unseen" categories are shown in Table 2. We observe that our approaches generally improve on both unseen and seen slot types compared to the baseline models. For the improvements in the unseen slots, our models are better able to capture the unseen slots since they explicitly learn the general pattern of slot entities. Interestingly, our models also bring large improvements in the seen slot types. We conjecture that it is also challenging to adapt models to seen slots due to the large variance between the source and target domains. For example, slot entities belonging to the "object type" in the "RateBook" domain are different from those in the "SearchCreativeWork" domain. Hence, the baseline models might fail to recognize these seen slots in the target domain, while our approaches can adapt to the seen slot types more quickly in comparison. In addition, we observe that template regularization improves performance in both seen and unseen slots, which illustrates that clustering representations based on templates can boost the adaptation ability.

Cross-domain NER
From Table 3, we see that the Coach framework is also suitable for the case where there are no unseen labels in the target domain in both the zero-shot and few-shot scenarios, while CT and RZT are not as effective as BiLSTM-CRF. However, we observe that template regularization loses its effectiveness   in this task, since the text in NER is relatively more open, which makes it hard to capture the templates for each label type.

Ablation Study
We conduct an ablation study in terms of the methods to encode the entity tokens (described in Eq. (3)) to investigate how they affect the performance. Instead of using BiLSTM, we try two alternatives. One is to use the encoder of Transformer (trs) (Vaswani et al., 2017), and the other is to simply sum the hidden states of slot entity tokens. From Table 4, we can see that there is no significant performance difference among different methods, and we observe that using BiLSTM to encode the entity tokens generally achieves better results.

Conclusion
We introduce a new cross-domain slot filling framework to handle the unseen slot type issue. Our model shares its parameters across all slot types and learns to predict whether input tokens are slot entities or not. Then, it detects concrete slot types for these slot entity tokens based on the slot type descriptions. Moreover, template regularization is proposed to improve the adaptation robustness further. Experiments show that our model significantly outperforms existing cross-domain slot filling approaches, and it also achieves better performance for the cross-domain NER task, where there is no unseen label type in the target domain.