Contrastive Zero-Shot Learning for Cross-Domain Slot Filling with Adversarial Attack

Zero-shot slot filling has widely arisen to cope with data scarcity in target domains. However, previous approaches often ignore constraints between slot value representation and related slot description representation in the latent space and lack enough model robustness. In this paper, we propose a Contrastive Zero-Shot Learning with Adversarial Attack (CZSL-Adv) method for the cross-domain slot filling. The contrastive loss aims to map slot value contextual representations to the corresponding slot description representations. And we introduce an adversarial attack training strategy to improve model robustness. Experimental results show that our model significantly outperforms state-of-the-art baselines under both zero-shot and few-shot settings.


Introduction
The slot filling task in the goal-oriented dialog system aims to identify task-related slot types in certain domains for understanding user utterances. Traditional supervised slot filling models (Liu and Lane, 2015;Liu and Lane, 2016;Goo et al., 2018;Haihong et al., 2019;He et al., 2020a;He et al., 2020b) have made great achievements. However, these models require massive amounts of labeled data for a new domain, hindering the rapid development of new tasks. To address the data-intensiveness problem, domain adaptation approaches (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019;Obeidat et al., 2019;Liu et al., 2020b;He et al., 2020c) have been successfully applied. In this paper, we focus on zero-shot cross-domain transfer learning which leverages knowledge learned in the source domains and adapts the models to the target domain without labeled training samples in the target domain.
The main challenge of zero-shot slot filling is to identify unseen slot types without any supervision signals in the target domain. Typically, the previous methods rely on slot descriptions or example values to bootstrap to new slots by capturing the semantic relationship between slot descriptions and input tokens. These methods can be classified into two categories: one-stage and two-stage. (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019) conduct one-stage slot filling individually for each slot type. They first generate word-level representations, then interact with the representation of each slot type description in semantic space. Finally, the predictions are independent for each slot type based on the fused features. The main drawback is the multiple prediction problem where a word may be predicted as multiple slot types. In contrast, (Liu et al., 2020a;Liu et al., 2020b) propose a two-stage slot filling framework. They first predict whether the tokens are slot entities or not by a BIO 3-way classifier, then identify their specific slot types based on slot type descriptions. Although the two-stage framework helps learn the general pattern of slot entities, it can't directly leverage auxiliary description information to facilitate to detect BIO labels as the one-stage framework does. Owing to limited labeled training data in source domains, another common issue of zero-shot slot filling is that existing approaches always suffer from weak generalization capability.
Inspired by the above challenges, in this paper, we propose a Contrastive Zero-Shot Learning with Adversarial Attack (CZSL-Adv) method for the cross-domain slot filling. To leverage auxiliary slot  description information for detecting BIO labels, we introduce a contrastive learning loss that maximizes the mutual information between raw input encoded representations and corresponding slot description delexicalized input representations. We aim to map slot value representations to the corresponding slot description representations in the latent space. Therefore, slot descriptions can help learn the semantic pattern of related slot entities. To improve the generalization capability of our model, we also propose an adversarial attack training strategy which adds adversarial noise to the inputs in the direction that significantly increases the model's classification loss. The training strategy further improves the adaptation robustness of our method. Our main contributions are three-fold: (1) We propose a contrastive zero-shot learning method for the cross-domain slot filling.
(2) We introduce an adversarial attack training strategy to improve model robustness.
(3) Experiments on zero-shot learning and few-shot learning settings show that our proposed CZSL-Adv outperforms the state-of-the-art models with large margins. We also provide a comprehensive ablation study and further experiment analysis.

Approach
Fig 1 shows the overall architecture of our proposed CZSL-Adv model. In the first stage, it predicts BIO labels with a contrastive representation loss to help learn the semantic pattern of corresponding slot entities. Then in the second stage, it classifies the slot entities into related types with slot descriptions using an adversarial attack training strategy.

CZSL Model
For a fair comparison, we adopt the same network architecture BiLSTM (Hochreiter and Schmidhuber, 1997) as previous work (Bapna et al., 2017;Lee and Jha, 2019;Shah et al., 2019;Liu et al., 2020a;Liu et al., 2020b). Given an utterance with n tokens as w = [w 1 , w 2 , . . . , w n ] and E denotes the embedding layer for utterances. We formulate the whole process as follows: where [p 1 , p 2 , . . . , p n ] are the logits for the 3-way BIO classification. Note that we do not show the CRF layer in Fig 1(a) for simplicity. The 3-way BIO classification loss aims to learn the general pattern of slot entities. However, it ignores related slot description representations. Therefore, we introduce a contrastive learning loss to leverage auxiliary slot description information for detecting BIO labels. Contrastive learning (CL) has achieved great success in the unsupervised visual representation learning (Tian et al., 2019;He et al., 2019;Misra and van der Maaten, 2019;Chen et al., 2020). The main idea behind CL is to learn representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. In this paper, given a raw input utterance, we generate a positive sample by replacing all the slot entity tokens with the corresponding slot labels. Similarly, replacing the slot entity tokens with different slot labels from the whole slot set will get a set of negative samples. Here, we choose a fixed replaced probability of p = 0.5 for each slot token individually. We randomly sample a minibatch of N examples and define the contrastive loss on pairs of replaced examples derived from the minibatch as follows: where M is the size of the negative sample set and u k represents k-th input utterance vector in the batch. u p k denotes the positive sample and u n k,i denotes i-th negative sample of k-th input utterance. d is the L2 distance function and s is the margin. Following (Felbo et al., 2017;Liu et al., 2020b), we employ another BiLSTM and an attention layer to generate representations of positive and negative samples. The contrastive loss aims to map slot value contextual representations to the corresponding slot description representations in the latent space. Therefore, slot descriptions can help learn the semantic pattern of related slot entities. Compared to Template Regularization (TR) proposed by Liu et al. (2020b), our CZSL jointly models pairs of positive and negative samples to distinguish semantic representations of different slot types.

Adversarial Attack Training
In this section, we introduce an adversarial attack training strategy as shown in Fig 1(b) to improve model robustness. Firstly, we can obtain a slot description matrix M desc ∈ R ns×ds where n s is the number of all the slot types and d s is the dimension of slot description representation. Following (Shah et al., 2019), we sum the embeddings of the slot description tokens as the description representation. Then, we perform average pooling over the hidden states for k-th slot entity tokens to get r k and calculate the dot product as classification logits s k = M desc · r k . Finally, we can get the classification cross-entropy loss L slot .
Due to limited labeled data in source domains, existing approaches are always vulnerable to noisy input utterances. Hence, apart from traditional classification entropy loss L slot , we apply Fast Gradient Value (FGV) (Miyato et al., 2017;Vedula et al., 2020) to approximate a worst-case perturbation as a noise vector:ṽ Here, the gradient is the first-order differential of the loss function L slot w.r.t. e, representing the direction that rapidly increases the loss function. We perform normalization and then use a small to ensure the approximate is reasonable. Then we add the noiseṽ noise and perform the second forward to get a new loss L slot . Finally, we use the adversarial attack loss L adv = L slot +L slot for the backpropagation. Adversarial noise enables the model to handle extensive noisy input utterances and can be regarded as a data augmentation mechanism. Experiments also show that the adversarial training strategy can effectively improve the performance of our CZSL method.

Setup
Dataset. To evaluate our approach, we conduct experiments on Snips (Coucke et al., 2018), a personal voice assistant dataset that contains 7 domains and 39 slots, where some slots are shared across domains while the others are domain-specific. In table 1, we give detailed statistics of Snips dataset. For each domain in Snips, we give number of samples, list of cross-domain shared slots, and list of domainspecific slots. To test our framework, each time, we choose one domain as the target domain and the other six domains as the source domains.
Baselines. In our experiments, we compare our approach with the following zero-shot/few-shot slot filling baselines:  • Concept Tagger (CT) A method proposed in (Bapna et al., 2017), which utilizes slot descriptions (e.g. "date of departure" for slot date of departure) to boost the performance on detecting novel slots in the target domain.
• Robust Zero-shot Tagger (RZT) A method proposed in (Shah et al., 2019), which utilizes both slot descriptions and slot example values (e.g. "iHeart Radio" for slot service) for zero-shot slot filling.
• Coarse-to-fine Approach (Coach) A method proposed in (Liu et al., 2020b), which splits the crossdomain slot filling task into two stages: coarse-grained BIO 3-way classification and fine-grained slot type classification, and uses slot descriptions in the second stage to help recognize unseen slots.
• Coach+TR A variant of Coach, which further applies template regularization to improve the slot filling performance of similar or the same slot types, and achieves better results.

Implementation Details
To conduct experiments under zero-shot settings, we follow the set-ups in (Liu et al., 2020b). First, we combine samples from the rest 6 domains for training. Then we split samples in the target domain into two sets: 500 samples as a validation set and the remain as a test set. For few-shot (50 samples) experiments, we further add 50 samples from the target domain to the training set. We fine-tune all hyperparameters in the validation set and report the F1-score in the test set. We achieve best results when s set to 0.15, set to 0.1 and M set to 2.
We use character-level and word-level embeddings for each input token and the total embedding dimension is 400. We set the hidden size of BiLSTM to 200 and use a dropout rate to 0.3 for all BiLSTM encoders. We set We use Adam optimizer (Kingma and Ba, 2014) to optimize all parameters with a learning rate of 0.0005. We set the batch size to 32 and use the early stop of patience 5.

Main Results
Table 2 displays the main results of our CZSL-Adv method compared to the state-of-the-art baselines. Our method outperforms the SOTA models by 3.6% on the average F1-score under zero-shot learning setting, and 1.76% under few-shot learning setting. The improvements demonstrate the effectiveness of our proposed method. Besides, CZSL and Adv respectively achieve superior performance by 1.74% and 1.05% under zero-shot learning setting. The contrastive loss can help learn the semantic pattern of related slot entities by corresponding slot descriptions. And the adversarial attack training strategy also achieves significant improvement. We observe our method gets better improvement under the zero-shot learning setting than few-shot. We hypothesize that our CZSL-Adv method effectively alleviates data scarcity than the previous models under the zero-shot learning setting.

Ablation Analysis
We compare the effect of CZSL and Adv in Table 2. For zero-shot experiments, both CZSL (39.13%) and Adv (38.44%) achieve better average F1-score than previous state-of-the-art (37.39%), which proves both    CZSL and Adv contribute to the final improvement. When compared to the full model (40.99%), Adv shows severer performance degradation (-2.55%) than CZSL (-1.86%), indicating that the performance improvement comes more from CZSL. Table 3 shows the results on seen and unseen slots in target domains. We can observe that our CZSL-Adv consistently outperforms the baselines on the unseen slots under the two settings but gets a relatively small drop on seen slots under the few-shot setting. The results prove that our CZSL-Adv makes an effect on the zero-shot learning scenario without sufficient supervised signals.

Analysis of Norm of Adversarial Attacks
Fig 2 displays the effect of norm of adversarial noise. controls the range of adversarial noiseṽ noise . We can see that for different target domains, = 0.1 always achieves better performance.

Conclusion
In this paper, we propose a Contrastive Zero-Shot Learning with Adversarial Attack (CZSL-Adv) method for cross-domain slot filling. The main contributions are contrastive representation learning and adversarial attack training. The former leverages slot descriptions to help learn the semantic pattern of related slot entities and the latter improves model robustness by augmenting noise inputs. Extensive experiments show the effectiveness of our proposed method, especially for the zero-shot learning setting.