Deep Unknown Intent Detection with Margin Loss

Identifying the unknown (novel) user intents that have never appeared in the training set is a challenging task in the dialogue system. In this paper, we present a two-stage method for detecting unknown intents. We use bidirectional long short-term memory (BiLSTM) network with the margin loss as the feature extractor. With margin loss, we can learn discriminative deep features by forcing the network to maximize inter-class variance and to minimize intra-class variance. Then, we feed the feature vectors to the density-based novelty detection algorithm, local outlier factor (LOF), to detect unknown intents. Experiments on two benchmark datasets show that our method can yield consistent improvements compared with the baseline methods.


Introduction
In the dialogue system, it is essential to identify the unknown intents that have never appeared in the training set. We can use those unknown intents to discover potential business opportunities. Besides, it can provide guidance for developers and accelerate the system development process. However, it is also a challenging task. On the one hand, it is often difficult to obtain prior knowledge about unknown intents due to lack of examples. On the other hand, it is hard to estimate the exact number of unknown intents. In addition, since user intents are strongly guided by prior knowledge and context, modeling high-level semantic concepts of intent is still problematic.
Few previous studies are related to unknown intents detection. For example, Kim and Kim (2018) try to optimize the intent classifier and out-ofdomain detector jointly, but out-of-domain samples are still needed. The generative method  try to generate positive and negative examples from known classes by using adversar-ial learning to augment training data. However, the method does not work well in the discrete data space like text, and a recent study (Nalisnick et al., 2019) suggests that this approach may not work well on real-world data. Brychcin and Král try to model intents through clustering. Still, it does not make good use of prior knowledge provided by known intents, and clustering results are usually unsatisfactory.
Although there is a lack of prior knowledge about unknown intents, we can still leverage the advantage of known label information. Scheirer et al. (2013); Fei and Liu (2016) suggest that a m-class classifier should be able to reject examples from unknown class while performing mclass classification tasks. The reason is that not all test classes have appeared in the training set, which forms a (m+1)-class classification problem where the (m+1) th class represents the unknown class. This task is called open-world classification problem. The main idea is that if an example dissimilar to any of known intents, it is considered as the unknown. In this case, we use known intents as prior knowledge to detect unknown intents and simplify the problem by grouping unknown intents into a single class. Bendale and Boult (2016) further extend the idea to deep neural networks (DNNs). Shu et al. (2017) achieve the state-of-the-art performance by replacing the softmax layer of convolution neural network (CNN) with a 1-vs-rest layer consist of sigmoid and tightening the decision threshold of probability output for detection. DNN such as BiLSTM (Goo et al., 2018;Wang et al., 2018c) has demonstrated the ability to learn high-level semantic features of intents. Nevertheless, it is still challenging to detect unknown intents when they are semantically similar to known intents. The reason is that softmax loss only focuses on whether the sample is correctly classi- fied, and does not require intra-class compactness and inter-class separation. Therefore, we replace softmax loss with margin loss to learn more discriminative deep features. The approach is widely used in face recognition Ranjan et al., 2017). It forces the model to not only classify correctly but also maximize inter-class variance and minimize intra-class variance. Concretely, we use large margin cosine loss (LMCL) (Wang et al., 2018b) to accomplish it. It formulates the softmax loss into cosine loss with L 2 norm and further maximizes the decision margin in the angular space. Finally, we feed the discriminative deep features to a density-based novelty detection algorithm, local outlier factor (LOF), to detect unknown intents.
We summarize the contributions of this paper as follows. First, we propose a two-stage method for unknown intent detection with BiLSTM. Second, we introduce margin loss on BiLSTM to learn discriminative deep features, which is suitable for the detection task. Finally, experiments conducted on two benchmark dialogue datasets show the effectiveness of the proposed method. To begin with, we use BiLSTM (Mesnil et al., 2015) to train the intent classifier and use it as feature extractor. Figure 1 shows the architecture of the proposed method. Given an utterance with maximum word sequence length , we transform a sequence of input words w 1: into m-dimensional word embedding v 1: , which is used by forward and backward LSTM to produce feature representations x: where v t denotes the word embedding of input at time step t. − → x t and ← − x t are the output vector of forward and backward LSTM respectively. − → c t and ← − c t are the cell state vector of forward and backward LSTM respectively.
We concatenate the last output vector of forward LSTM − → x and the first output vector of backward LSTM ← − x 1 into x as the sentence representation. It captures high-level semantic concepts learned by the model. We take x as the input of the next stage.

Large Margin Cosine Loss (LMCL)
At the same time, we replace the softmax loss of BiLSTM with LMCL (Nalisnick et al., 2019). We define LMCL as the following: e s·(cos (θ y i ,i )−m) + j =y i e s·cos θ j,i ,  constrained by where N denotes the number of training samples, y i is the ground-truth class of the i-th sample, s is the scaling factor, m is the cosine margin, W j is the weight vector of the j-th class, and θ j is the angle between W j and x i . LMCL transforms softmax loss into cosine loss by applying L2 normalization on both features and weight vectors. It further maximizes the decision margin in the angular space. With normalization and cosine margin, LMCL forces the model to maximize inter-class variance and to minimize intra-class variance. Then, we use the model as the feature extractor to produce discriminative intent representations.

Local Outlier Factor (LOF)
Finally, because the discovery of unknown intents is closely related to the context, we feed discriminative deep features x to LOF algorithm (Breunig et al., 2000) to help us detect unknown intents in the context with local density. We compute LOF as the following: where N k (A) denotes the set of k-nearest neighbors and lrd denotes the local reachability density. We define lrd as the following: where lrd k (A) denotes the inverse of the average reachability distance between object A and its neighbors. We define reachdist k (A, B) as the following: where d(A,B) denotes the distance between A and B, and k-dist denotes the distance of the object A to the k th nearest neighbor. If an example's local density is significantly lower than its k-nearest neighbor's, it is more likely to be considered as the unknown intents.

Datasets
We have conducted experiments on two publicly available benchmark dialogue datasets, including SNIPS and ATIS (Tür et al., 2010). The detailed statistics are shown in Table 1. SNIPS 1 SNIPS is a personal voice assistant dataset which contains 7 types of user intents across different domains.
ATIS (Airline Travel Information System) 2 ATIS dataset contains recordings of people making reservations with 18 types of user intent in the flight domain.

Baselines
We compare our methods with state-of-the-art methods and a variant of the proposed method.
1. Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2016) Consider the maximum softmax probability of a sample as the score, if a sample does not belong to any known intents, its score will be lower. We calculate and apply a confidence threshold on the score as the simplest baseline where the threshold is set as 0.5. 2. DOC (Shu et al., 2017) Table 2: Macro f1-score of unknown intent detection with different proportion (25%, 50% and 75%) of classes are treated as known intents on SNIPS and ATIS dataset.

LOF (Softmax)
A variant of the proposed method for ablation study. We use softmax loss to train the feature extractor rather than LMCL.

Experimental Settings
We follow the validation setting in (Fei and Liu, 2016;Shu et al., 2017) by keeping some classes in training as unknown and integrate them back during testing. Then we vary the number of known classes in training set in the range of 25%, 50%, and 75% classes and use all classes for testing. To conduct a fair evaluation for the imbalanced dataset, we randomly select known classes by weighted random sampling without replacement in the training set. If a class has more examples, it is more likely to be chosen as the known class. Meanwhile, the class with fewer examples still have a chance to be selected. Other classes are regarded as unknown and we will remove them in the training and validation set.
We initialize the embedding layer through GloVe (Pennington et al., 2014) pre-trained word vectors 3 . For BiLSTM model, we set the output dimension as 128 and the maximum epoch as 200 with early stop. For LMCL and LOF, we follow the original setting in their paper. We use macro f1-score as the evaluation metric and report the average result over 10 runs. We set the scaling factor s as 30 and cosine margin m as 0.35, which is recommended by Wang et al. (2018a).

Results and Discussion
We show the experiment results in Table 2. Firstly, our method consistently performs better than all baselines in all settings. Compared with DOC, our method improves the macro f1-score on SNIPS by 6.7%, 16.2% and 14.9% in 25%, 50%, and 75% setting respectively. It confirms the effectiveness of our two-stage approach.
Secondly, our method is also better than LOF (Softmax). In Figure 2, we use t-SNE (Maaten and Hinton, 2008) to visualize deep features learned with softmax and LMCL. We can see that the deep features learned with LMCL are intra-class com-pact and inter-class separable, which is beneficial for novelty detection algorithms based on local density.
Thirdly, we observe that on the ATIS dataset, the performance of unknown intent detection dramatically drops as the known intent increases. We think the reason is that the intents of ATIS are all in the same domain and they are very similar in semantics (e.g., flight and flight no). The semantics of the unknown intents can easily overlap with the known intents, which leads to the poor performance of all methods.
Finally, compared with ATIS, our approach improve even better on SNIPS. Since the intent of SNIPS is originated from different domains, it causes the DNN to learn a simple decision function when the known intents are dissimilar to each other. By replacing the softmax loss with the margin loss, we can push the network to further reduce the intra-class variance and the inter-class variance, thus improving the robustness of the feature extractor.

Conclusion
In this paper, we proposed a two-stage method for unknown intent detection. Firstly, we train a BiLSTM classifier as the feature extractor. Secondly, we replace softmax loss with margin loss to learn discriminative deep features by forcing the network to maximize inter-class variance and to minimize intra-class variance. Finally, we detect unknown intents through the novelty detection algorithm. We also believe that broader families of anomaly detection algorithms are also applicable to our method.
Extensive experiments conducted on two benchmark datasets show that our method can yield consistent improvements compared with the baseline methods. In future work, we plan to design a solution that can identify the unknown intent from known intents and cluster the unknown intents in an end-to-end fashion.