Robust Zero-Shot Cross-Domain Slot Filling with Example Values

Task-oriented dialog systems increasingly rely on deep learning-based slot filling models, usually needing extensive labeled training data for target domains. Often, however, little to no target domain training data may be available, or the training and target domain schemas may be misaligned, as is common for web forms on similar websites. Prior zero-shot slot filling models use slot descriptions to learn concepts, but are not robust to misaligned schemas. We propose utilizing both the slot description and a small number of examples of slot values, which may be easily available, to learn semantic representations of slots which are transferable across domains and robust to misaligned schemas. Our approach outperforms state-of-the-art models on two multi-domain datasets, especially in the low-data setting.


Introduction
Goal-oriented dialog systems assist users with tasks such as finding flights, booking restaurants and, more recently, navigating user interfaces, through natural language interactions. Slot filling models, which identify task-specific parameters/slots (e.g. flight date, cuisine) from user utterances, are key to the underlying spoken language understanding (SLU) systems. Advances in SLU have enabled virtual assistants such as Siri, Alexa and Google Assistant. There is also significant interest in adding third-party functionality to these assistants. However, supervised slot fillers (Young, 2002;Bellegarda, 2014) require abundant labeled training data, more so with deep learning enhancing accuracy at the cost of being data intensive (Mesnil et al., 2015;Kurata et al., 2016).  Two key challenges with scaling slot fillers to new domains are adaptation and misaligned schemas (here, slot name mismatches). Extent of supervision may vary across domains: there may be ample data for Flights but none for Hotels, requiring models to leverage the former to learn semantics of reusable slots (e.g. time, destination). In addition, schemas for overlapping domains may be incompatible by way of using different names for the same slot or the same name for different slots. This is common with web form filling: two sites in the same domain may have misaligned schemas, as in Figure 1, precluding approaches that rely on schema alignment.
Zero-shot slot filling, typically, either relies on slot names to bootstrap to new slots, which may be insufficient for cases like in Figure 1, or uses hardto-build domain ontologies/gazetteers. We counter that by supplying a small number of example values in addition to the slot description to condition the slot filler. This avoids negative transfer from misaligned schemas and further helps identify unseen slots while retaining cross-domain transfer ability. Besides, example values for slots can either be crawled easily from existing web forms or specified along with the slots, with little overhead.
Given as few as 2 example values per slot, our model surpasses prior work in the zero/few-shot setting on the SNIPS dataset by an absolute 2.9% slot F1, and is robust to misaligned schemas, as experiments on another multi-domain dataset show.

Related Work
Settings with resource-poor domains are typically addressed by adapting from resource-rich domains (Blitzer et al., 2006;Pan et al., 2010;Chen et al., 2018;Guo et al., 2018;Shah et al., 2018). To this end approaches such as domain adversarial learning (Liu and Lane, 2017) and multi-task learning (Jaech et al., 2016;Goyal et al., 2018;Siddhant et al., 2018) have been adapted to SLU and related tasks (Henderson et al., 2014). Work targeting domain adaptation specifically for this area includes, modeling slots as hierarchical concepts (Zhu and Yu, 2018) and using ensembles of models trained on data-rich domains (Gašić et al., 2015;Kim et al., 2017;Jha et al., 2018).
The availability of task descriptions has made zero-shot learning (Norouzi et al., 2013;Socher et al., 2013) popular. In particular, work on zeroshot utterance intent detection has relied on varied resources such as click logs (Dauphin et al., 2013) and manually defined domain ontologies (Kumar et al., 2017), as well as models such as deep structured semantic models (Chen et al., 2016) and capsule networks (Xia et al., 2018). Zero-shot semantic parsing is addressed in Krishnamurthy et al. (2017) and Herzig and Berant (2018) and specifically for SLU utilizing external resources such as label ontologies in Ferreira et al. (2015a,b) and handwritten intent attributes in Yazdani and Henderson (2015); Chen et al. (2015). Our work is closest in spirit to Bapna et al. (2017) and Lee and Jha (2018), who employ textual slot descriptions to scale to unseen intents/slots. Since slots tend to take semantically similar values across utterances, we augment our model with example values, which are easier for developers to define than manual alignments across schemas (Li et al., 2011).

Problem Statement
We frame our conditional sequence tagging task as follows: given a user utterance with T tokens and a slot type, we predict inside-outside-begin (IOB) tags {y 1 , y 2 . . . y T } using 3-way classification per token, based on if and where the provided slot type occurs in the utterance. Figure 3 shows IOB tag sequences for one positive (slot service, present in the utterance) and one negative (slot timeRange, not present in the utterance) instance each.

Model Architecture
Figure 2 illustrates our model architecture where a user utterance is tagged for a provided slot. To represent the input slot, along with a textual slot description as in Bapna et al. (2017), we supply a small set of example values for this slot, to provide a more complete semantic representation. 1 Detailed descriptions of each component follow.

Inputs:
We use as input d wc -dimensional embeddings for 3 input types: T user utterance tokens {u i ∈ R dwc , 1≤i≤T }, S input slot description tokens {d i ∈ R dwc , 1≤i≤S}, and K example values for the slot, with the N k token embedding for the k th example denoted by {e k i ∈ R dwc , 1≤i≤N k }. Utterance encoder: We encode the user utterance using a d en -dimensional bidirectional GRU recur- Slot description encoder: We obtain an encoding d s ∈ R dwc of the slot description by mean-pooling the embeddings for the S slot description tokens.
Slot example encoder: We first obtain encodings {e x k ∈ R dwc , 1≤k≤K} for each slot example value by mean-pooling the N k token embeddings. Then, we compute an attention weighted encoding of all K slot examples {e a i ∈ R dwc , i≤1≤T } for each utterance token, with the utterance token encoding as attention context. Here, α x i ∈ R K denotes attention weights over all K slot examples corresponding to the i th utterance token, obtained with general cosine similarity (Luong et al., 2015).
Tagger: We feed the concatenated utterance, slot description and example encodings to a d endimensional bidirectional LSTM. The output hidden states X = {x i ∈ R den , 1≤i≤T } are used for a 3-way IOB tag classification per token.
Parameters: We use fixed d w =128-dim pretrained word embeddings 2 for all tokens. We also train per-character embeddings, fed to a 2-layer convolutional neural network (Kim, 2014) to get a d c =32-dim token embedding. For all inputs, the d wc =160-dim final embedding is the concatenation of the word and char-CNN embeddings. The RNN encoders have hidden state size d en =128.
All trainable weights are shared across intents and slots. The model relies largely on fixed word embeddings to generalize to new intents/slots.

Datasets and Experiments
In this section we describe the datasets used for evaluation, baselines compared against, and more details on the experimental setup.
Datasets: In order to evaluate cross-domain transfer learning ability and robustness to misaligned schemas, respectively, we use the following two SLU datasets to evaluate all models.
• SNIPS: This is a public SLU dataset (Coucke et al., 2018) of crowdsourced user utterances with 39 slots across 7 intents and ∼2000 training instances per intent. Since 11 of these slots are shared (see Table 1), we use this dataset to evaluate cross-domain transfer learning.
• XSchema: This is an in-house crowdsourced dataset with 3 intents (500 training instances each). Training and evaluation utterances are annotated with different schemas (Table 1) from real web forms to simulate misaligned schemas.
Baselines: We compare with two strong zeroshot baselines: Zero-shot Adaptive Transfer (ZAT) (Lee and Jha, 2018) and Concept Tagger  We sample positive and negative instances (Figure 3) in a ratio of 1:3. Slot values input during training and evaluation are randomly picked from values taken by the input slot in the relevant domain's training set, excluding ones that are also present in the evaluation set. In practice, it is usually easy to obtain such example values for each slot either using automated methods (such as crawling from existing web forms) or have them be provided as part of the slot definition, with negligible extra effort.
To improve performance on out-of-vocabulary entity names, we randomly replace slot value tokens in utterances and slot examples with a special token, and raise the replacement rate linearly from 0 to 0.3 during training (Rastogi et al., 2018).
The final cross-entropy loss, averaged over all utterance tokens, is optimized using ADAM (Kingma and Ba, 2014) for 150K training steps.  Slot F1 score (Sang and Buchholz, 2000) is our final metric, reported after 3-fold cross-validation.

Results
For the SNIPS dataset, Table 2 shows slot F1 scores for our model trained with randomlypicked slot value examples in addition to slot descriptions vis-à-vis the baselines. Our best model consistently betters the zero-shot baselines CT and ZAT, which use only slot descriptions, overall and individually for 5 of 7 intents. The average gain over CT and ZAT is ∼3% in the zero-shot case.
In the low-data setting, all zero-shot models gain ≥5% over the multi-domain LSTM baseline (with the 10-example-added model further gaining ∼2% on CT/ZAT). All models are comparable when all target data is used for training, with F1 scores of 87.8% for the LSTM, and 86.9% and 87.2% for CT and our model with 10 examples respectively. Table 3 shows slot F1 scores for XSchema data. Our model trained with 10 example values is robust to varying schemas, with gains of ∼3% on BookBus, and ∼10% on FindFlights and Book-Room in the zero-shot setting.
For both datasets, as more training data for the target domain is added, the baselines and our approach perform more similarly. For instance, our approach improves upon the baseline by ∼0.2% on SNIPS when 2000 training examples are used for the target domain, affirming that adding example values does not hurt in the regular setting.
Results by slot type: Example values help the most with limited-vocabulary slots not encountered during training: our approach gains ≥20% on slots such as conditionDescription, bestRating, service (present in intents GetWeather, RateBook, PlayMusic respectively). Intents PlayMusic and GetWeather, with several limited-vocabulary slots, see significant gains in the zero-shot setting. For compositional open-vocabulary slots (city, cuisine), our model also compares favorably -e.g. 53% vs 27% slot F1 for unseen slot cuisine (intent BookRestaurant) -since the semantic similarity between entity and possible values is easier to capture than between entity and description.
Slots with open, non-compositional vocabularies (such as objectName, entityName) are hard to infer from slot descriptions or examples, even if these are seen during training but in other contexts, since utterance patterns are lost across intents. All models are within 5% slot F1 of each other for such slots. This is also observed for unseen openvocabulary slots in the XSchema dataset (such as promoCode and hotelName).
For XSchema experiments, our model does significantly better on slots which are confusing across schemas (evidenced by gains of >20% on depart in FindFlights, roomType in BookRoom). Figure 4 shows the number of slot value examples used versus performance on SNIPS. For the zero-shot case, using 2 example values per slot works best, possibly due to the model attending to perfect matches during training, impeding generalization when more example values are used. In the few-shot and normal-data settings, using more example values helps accuracy, but the gain drops with more target training data. For XSchema, in contrast, adding more example values consistently improves performance, possibly due to more slot name mistmatches in the dataset. We avoid using over 10 example values, in contrast to prior work (Krishnamurthy et al., 2017;Naik et al., 2018) since it may be infeasible to easily provide or extract a large number of values for unseen slots.

Effect of number of examples:
Ablation: Slot replacement offsets overfitting in our model, yielding gains of 2−5% for all models incl. baselines. Fine-tuning the pretrained word embeddings and removing character embeddings yielded losses of ∼1%. We tried more complex phrase embeddings for the slot description and example values, but since both occur as short phrases in our data, a bag-of-words approach worked well.
Comparison with string matching: A training and evaluation setup including example values for slots may lend itself well to adding string matching-based slot fillers for suitable slots (for example, slots taking numeric values or having a small set of possible values). However, this is not applicable to our exact setting since we ensure that the slot values to be tagged during evaluation are never provided as input during training or evaluation. In addition, it is difficult to distinguish two slots with the same expected semantic type using such an approach, such as for slots ratingValue and bestRating from SNIPS intent RateBook.

Conclusions and Future Work
We show that extending zero-shot slot filling models to use a small number of easily obtained example values for slots, in addition to textual slot descriptions, is a scalable solution for zero/few-shot slot filling tasks on similar and heterogenous domains, while resistant to misaligned overlapping schemas. Our approach surpasses prior state-ofthe-art models on two multi-domain datasets.
The approach can, however, be inefficient for intents with many slots, as well as potentially sacrificing accuracy in case of overlapping predictions. Jointly modeling multiple slots for the task is an interesting future direction. Another direction would be to incorporate zero-shot entity recognition (Guerini et al., 2018), thus eliminating the need for example values during inference.
In addition, since high-quality datasets for downstream tasks in dialogue systems (such as dialogue state tracking and dialogue management) are even more scarce, exploring zero-shot learning approaches to these problems is of immense value in building generalizable dialogue systems.