Adversarial Self-Supervised Learning for Out-of-Domain Detection

Detecting out-of-domain (OOD) intents is crucial for the deployed task-oriented dialogue system. Previous unsupervised OOD detection methods only extract discriminative features of different in-domain intents while supervised counterparts can directly distinguish OOD and in-domain intents but require extensive labeled OOD data. To combine the benefits of both types, we propose a self-supervised contrastive learning framework to model discriminative semantic features of both in-domain intents and OOD intents from unlabeled data. Besides, we introduce an adversarial augmentation neural module to improve the efficiency and robustness of contrastive learning. Experiments on two public benchmark datasets show that our method can consistently outperform the baselines with a statistically significant margin.


Introduction
Task-oriented dialog systems (Sarikaya, 2017;Akasaki and Kaji, 2017;Gnewuch et al., 2017;Shum et al., 2018;Tulshan and Dhage, 2018) such as Google's DialogFlow or Amazon's Lex have become ubiquitous to make people interact with machines using natural language. In the architecture of a dialogue system, detecting unknown or OOD (Out-of-Domain) intents from user queries is an essential component that aims to know when a user query falls outside their range of predefined supported intents. Different from traditional intent detection tasks, we do not know the exact number of unknown intents in practical scenarios and can barely annotate extensive OOD samples. Lack of real OOD examples always leads to poor prior knowledge about these unknown intents, making it * Weiran Xu is the corresponding author. 1 Our code is available at https://github.com/p arZival27/Adversarial-Self-Supervised-Ou t-of-Domain-Detection. challenging to identify OOD samples in the taskoriented dialog system.
Most of unsupervised OOD detection methods follow a two-stage framework: training and detecting. They first train an in-domain intent classifier to extract intent representations, then detect whether the test query belongs to OOD by estimating its probability density. For example, Hendrycks and Gimpel (2017); Shu et al. (2017) simply use a threshold on the in-domain classifier's probability estimate. Lin and Xu (2019) employs an unsupervised density-based novelty detection algorithm, local outlier factor (LOF) to detect unseen intents. However, such neural models can only extract discriminative features of different in-domain intents since they are trained on the in-domain data without access to OOD data. Therefore, these methods are known to produce highly overconfident posterior distributions even for such abnormal OOD samples (Guo et al., 2017;Liang et al., , 2018. For supervised OOD detection, classical methods such as (Fei and Liu, 2016;Larson et al., 2019), form a (N + 1)-class classification problem where the (N + 1)-th class represents the unseen intents. Further, Zheng et al. (2020) Figure 1: The overall architecture of our proposed framework. We first train an intent representation extractor using two kinds of objectives: supervised cross-entropy loss on the in-domain data and self-supervised contrastive loss on the unlabeled data. Then we extract the representation of the test query to detect OOD using MSP (Maximum Softmax Probability) (Hendrycks and Gimpel, 2017), LOF (Lin and Xu, 2019) or GDA (Xu et al., 2020).
to the uniform distribution. However, collecting large-scale labeled OOD data is usually difficult and expensive. These drawbacks limit the broad application of supervised OOD detection. In this paper, we aim to capitalize on the benefits of both self-supervised and supervised OOD detection: (1) simultaneously modeling semantic features of both in-domain and OOD data; (2) inducing no laborintensive OOD annotation.
In this paper, we propose a self-supervised contrastive learning framework to model discriminative semantic features of both in-domain intents and OOD intents from unlabeled data. Without access to labeled OOD data, our method aims to learn representations that discriminate between all unlabeled intents in the instance level. When combined with supervised in-domain training, our method learns features that are both rich and semantically discriminative. Besides, to replace the stochastic data augmentation mechanisms like random cropping, random color distortions in the image processing field (Chen et al., 2020a), we propose an adversarial augmentation neural module to improve the diversity and complexity of pre-defined transformation functions. Specifically, we compute modelagnostic adversarial worst-case perturbations to the inputs in the direction that significantly increases the original contrastive loss. Intuitively, adversarial learning can generate pseudo hard positive pairs thus improve the efficiency and robustness of contrastive learning. Our contributions are three-fold: (1) We propose a self-supervised learning framework to simultaneously modeling semantic features of both in-domain and OOD data. (2) We apply an adversarial augmentation mechanism to improve the efficiency and robustness of self-supervised learning.
(3) Experiments conducted on two benchmark OOD datasets show the effectiveness of our proposed method.

Approach
Overall Architecture Fig 1(a) shows the overall architecture of our proposed two-stage framework. We first train an in-domain intent classifier to extract intent representations using two objectives then use the detection algorithms MSP (Hendrycks and Gimpel, 2017), LOF (Lin and Xu, 2019) or GDA (Xu et al., 2020) to detect OOD. In the training stage, we first train a BiLSTM in-domain intent classifier similar to Lin and Xu (2019) using labeled in-domain data. Then we apply an adversarial contrastive objective to continue training on the unlabeled data.
Self-Supervised Contrastive Learning To simultaneously model semantic features of both in-domain and OOD data, we propose a selfsupervised contrastive learning framework to utilize unlabeled data. Following (Chen et al., 2020a;He et al., 2020a;Chen et al., 2020b;Winkens et al., 2020;Jiang et al., 2020), we formulate the contrastive loss for a positive pair of examples (i, j) as: (1) where z i represents the feature vector of i-th sentence sample extracted by concatenating the first and final hidden states of BiLSTM, and 1 [k =i] ∈ {0, 1} is an indicator function evaluating to 1 if k = i. τ denotes a temperature parameter. The final loss is computed across all positive pairs, both (i, j) and (j, i) in a mini-batch of N examples. Here we use back-translation as data augmentation to generate positive pairs. Previous work (Chen et al., 2020a) has shown the necessity of more data augmentations, thus we propose an adversarial neural augmentation as follows.
Adversarial Neural Augmentation To improve the diversity of data augmentation and avoid handcrafted engineering, we apply adversarial attack (Goodfellow et al., 2015;Kurakin et al., 2016;Miyato et al., 2016;Jia and Liang, 2017;Zhang et al., 2019;Ren et al., 2019b) to generate pseudo positive samples. It should be noted that samples obtained by adversarial attack is in the form of embedding to ensure end-to-end training. Specifically, we need to compute the worst-case perturbation δ that maximizes the original contrastive loss L: the parameters of a model and x denotes a given sample. is the norm bound of the perturbation δ.
In practical implementation, we apply Fast Gradient Value (FGV) (Rozsa et al., 2016) to approximate the perturbation δ: (2) where (x i , x j ) represents the original positive pair generated by back-translation. We perform normalization to g and then use a small to ensure the approximate is reasonable. Finally, we can obtain the pseudo adversarial sample x adv i = x i +δ as well as (2) Adversarial-to-Adversarial (A2A): the adversarial contrastive loss using (x adv i , x adv j ); (3) Standard-to-Adversarial (S2A): the mixed contrastive loss using (x i , x adv i ) or (x j , x adv j ); (4) Dual Stream (DS): combining S2S and A2A as Fig 1(c) shows. Experiment 3.4 shows that the last setting works best. We argue that DS capture better feature alignment in the latent space. 2 Besides, we find only applying the contrastive loss leads to the worse in-domain intent detection metrics, therefore we mix up the two kinds of objectives during training to avoid catastrophic forgetting (Kirkpatrick et al., 2017). We present an Algorithm section in the appendix.  To construct the unlabeled data, we mix up 10% of in-domain data and all of the OOD data in the training set. The total amount of unlabeled data is equal to 1500 in CLINC-OOS+ and 750 in CLINC-Small, where the number of OOD data is 250 and 100, respectively. Note that during the selfsupervised learning phase, we don't utilize label information of the unlabeled data and only perform contrastive learning at the instance-level. During the supervised learning phase, we use the other in-domain training data for cross-entropy loss. Metrics We report both in-domain metrics: Accuracy(ACC) and F1-score(F1), and OOD metrics: Recall and F1-score(F1). OOD Recall and F1-score are the main metrics in this paper.

Baseline Details
We compare our proposed self-supervised methods to two types of OOD detection methods, which are supervised and fully unsupervised. The former applies a supervised OOD entropy regularization. We use this setting as the reference upper bound for OOD detection results. The latter represents that we train the sentence feature extractor using only in-domain data. We treat this setting as the  reference lower bound. For each training method, we use different OOD detection models to verify its performance. Therefore, the model proposed in this paper can be divided into two stages. Firstly, the feature extractor training is completed in the training stage, and then the OOD detection is conducted by using different models in detection stage.
Training Stage On the basis of fully unsupervised setting, our proposed four types of adversarial self-supervised learning settings are added, respectively. Standard-to-Standard (S2S): Original setting. The contrastive loss is computed between origin and augmented data. The adversarial attack is not involved. Adversarial-to-Adversarial (A2A): The setting injecting two adversarial attacks to origin data and augmented data first, then compute contrastive loss between them. Standard-to-Adversarial (S2A): This setting divide contrastive loss into two parts. One uses origin data with adversarial attack and augmented data, the other uses augmented data with adversarial attack and origin data. Dual Stream (DS): The setting combining S2S and A2A. The contrastive loss contains two parts. One uses origin data and augmented data, the other uses corresponding data with adversarial attacks. Detection Stage As mentioned above, we compare three OOD detection models: MSP (Maximum Softmax Probability)(Hendrycks and Gimpel, 2017) applies a threshold on the maximum softmax probability where the threshold is set as 0.5. LOF (Local Outlier Factor)(Lin and Xu, 2019) uses the local outlier factor to detect unknown intents. GDA (Gaussian Discriminant Analysis) (Xu et al., 2020) is a generative distance-based classifier for out-ofdomain detection with Euclidean and Mahalanobis distances.
In this paper, the experiments and analysis are mainly conducted around the training stage. Different detection models are used to verify the generalization of our proposed method.

Main Results
Table 2 displays the experiment results. Our method consistently outperforms all the unsupervised baselines in all settings, even close to the supervised oracles. Under the GDA setting, our proposed method outperforms the unsupervised method by 3.94%(OOD F1), 3.54%(OOD Recall) in CLINC-OOS+ and 4.48%(OOD F1), 4.12%(OOD Recall) in CLINC-Small. We also observe similar improvements on the MSP and LOF settings. The results confirm the effectiveness of our self-supervised learning method. Considering the effect of adversarial augmentation, our GDA+DS outperforms the standard contrastive learning (GDA+S2S(w/o adv)) by 1.95%(OOD F1), 2.32%(OOD Recall) in CLINC-OOS+ and 1.35%(OOD F1), 1.72%(OOD Recall) in CLINC-Small. The results demonstrate that adversarial attack can improve the efficiency and robustness of contrastive learning. For in-domain ACC and F1, our method also achieves slightly better performance, even close to N+1 which suffers from a severe drop in OOD metrics for unbalanced data.

Qualitative Analysis
Effect of Unlabeled Data Size. Fig 2 shows the effect of different sizes of unlabeled data for contrastive learning. We extract each subsets of the total CLINC-OOS+ unlabeled dataset through random sampling, so that the expectation of OOD proportion in every subset is close to the full set (16.67%). We choose LOF and GDA for comparison. The lower bound and upper bound respectively represent unsupervised and supervised OOD. Our method achieves superior performance along with the increase of unlabeled data under two settings. It confirms that our proposed method can learn rich and semantically discriminative features via unlabeled data to facilitate OOD detection. Fig 3 shows the relative increment of the F1score during the uniform increase of unlabeled data. Specifically, the difference between the current F1score and the previous state F1-score is recorded for every 300 samples added. As the amount of data increases uniformly, the extent of increment of OOD F1-score decrease. It confirms that our proposed method can optimize the performance of OOD detection by taking full advantage of unlabeled data and achieve impressive performance with only a small amount of data. Generally, our proposed methods have strong robustness and generalization capability.   tings. Table 3 shows the results of different contrastive learning setting on CLINC-OOS+. DS achieves the best performance in both in-Domain and OOD metrics. Comparing A2A and S2A to S2S, we observe adversarial augmentation improves OOD performance but decreases in-domain metrics. Therefore, by combining S2S and A2A, our DS can get the benefits of OOD and in-domain improvements from both settings. Analysis of Norm of Adversarial Perturbation. Fig 4 displays the effect of norm of adversarial noise. controls the range of adversarial perturbation δ. In both LOF and GDA, ∈ (1.0, 1.5) achieves better performances. A smaller or larger value both impair the capability of contrastive learning. We argue that small noise can not improve the complexity of augmentation and large noise may hurt the alignment of positive example pairs.

Conclusion
In this paper, we focus on combining the benefits of both unsupervised and supervised OOD detection: simultaneously modeling semantic features of both in-domain and OOD data without requiring labor-intensive OOD annotation. We propose a self-supervised contrastive learning framework to learn rich and semantically discriminative representations from unlabeled data. Besides, we propose an adaptive end-to-end adversarial augmentation neural module to improve the diversity and complexity of pre-defined transformation functions. Experiments show that our method achieves better performance than unsupervised OOD baselines, even close to supervised OOD oracles.

Broader Impact
Task-oriented dialog systems have demonstrated remarkable performance across a wide range of applications, with the promise of a significant positive impact on human production mode and lifeway. However, in scenarios where information is complex and rapidly changing, models usually face input that is meaningfully different from typical examples encountered during training. Current models are prone to make unfounded but overconfident predictions on these inputs, which may affect human judgment and thus impair the safety of models in practical applications. In domains with the greatest potential for societal impacts, such as navigation or medical diagnosis, models should be able to detect potentially agnostic OOD and be robust to high-entropy inputs to avoid catastrophic errors. This work proposes a new adversarial self-supervised learning method for OOD detection. The overall robustness of the model is significantly improved by making full use of unlabeled data with potential threats through contrastive learning and adversarial attacks, which takes a step towards the ultimate goal of enabling the safe realworld deployment of task-oriented dialog systems in safety-critical domains. The experimental results have been reported on standard benchmark datasets for considerations of reproducible research.

A.1 Implementation Details
We sample the augmentation data proportional to the size of the training set of each dataset. We use public APIs from multiple platforms to complete the back-translation process. Considering the availability of augmented data obtained through the back-translation process, we only sample back translated sequences that are more than 70% and less than 90% similar to the origin text in words overlapping. The total amount of data we sampled is equal to 10% of the volume of in-domain training data. We use the pre-trained GloVe embeddings (Pennington et al., 2014) as the word embedding matrix. For the BiLSTM encoder, we set the dimension of hidden states to 128 and use a dropout rate of 0.5. We use Adam optimizer (Kingma and Ba, 2014) to train our model and use a learning rate of 0.001. In the training stage, 20 epochs of supervised training are first conducted on in-domain labeled data, and then 200 epochs of alternate training are conducted by adding the process of contrastive learning on unlabeled data.
The alternate training stage has an early stop setting with patience equalling 20. The algorithm of our proposed training process can be found in the Algorithm.1. We use the best F1-scores on the validation set to calculate the GDA threshold adaptively. Each result of the experiments is tested 5 times under the same setting and gets the average value. The amplitude of adversarial perturbation is obtained by the heuristic method in the range of 0 to 1E-2 (2.5/250), in which MSP and LOF are 4E-3 (1.0/250) and GDA is 6E-3 (1.5/250). In order to fairly compare with other settings, we set the weights of two losses equal in DS (α = 1 in Algorithm.1) and S2A (β = 0.5 in Algorithm.1). The training stage of our model lasts about 15 minutes on a single Tesla T4 GPU(16 Gb of memory). The average value of the model parameters is 2.52M.

Algorithm 1 Algorithm of Proposed Two-Stage Training
Input: A set of clean sentences x and corresponding ground truth label t for labeled data; Feature Extractor g; Similarity algorithm f ; L N T represent contrastive loss; L CE represent cross entropy loss Output: Model parameters θ for number of in-domain pretraining epochs do for in-domain mini-batch (x, t) do L i = L CE (g(x, θ), t) Update parameters θ to minimize L i end for end for for number of mix-up training epochs do for sampled unlabeled mini-batch x do augment x i to be x j with back translate augmentation Generate the corresponding adverasial mini-batch (x i + δ i , x j + δ j ) if mode == S2S then L m = L N T (f (x i , x j , θ)) end if if mode == A2A then L m = L N T (f (x i + δ i , x j + δ j , θ)) end if if mode == S2A then L m = βL N T (f (x i , x j + δ j , θ)) + (1 − β)L N T (f (x i + δ i , x j , θ)) end if if mode == DS then L m = L N T (f (x i , x j , θ)) + αL N T (g(x i + δ i , x j + δ j , θ)) end if Update parameters θ to minimize L m end for for in-domain mini-batch (x, t) do L i = L CE (g(x, θ), t) Update parameters θ to minimize L i end for end for