Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification

User intent classification plays a vital role in dialogue systems. Since user intent may frequently change over time in many realistic scenarios, unknown (new) intent detection has become an essential problem, where the study has just begun. This paper proposes a semantic-enhanced Gaussian mixture model (SEG) for unknown intent detection. In particular, we model utterance embeddings with a Gaussian mixture distribution and inject dynamic class semantic information into Gaussian means, which enables learning more class-concentrated embeddings that help to facilitate downstream outlier detection. Coupled with a density-based outlier detection algorithm, SEG achieves competitive results on three real task-oriented dialogue datasets in two languages for unknown intent detection. On top of that, we propose to integrate SEG as an unknown intent identifier into existing generalized zero-shot intent classification models to improve their performance. A case study on a state-of-the-art method, ReCapsNet, shows that SEG can push the classification performance to a significantly higher level.


Introduction
Understanding user intent is crucial for developing conversational and dialogue systems. It is essential to accurately identify the intent behind a user utterance to better guide downstream decisions and policies. With the advent of conversational AI, dialogue systems are becoming central tools in many applications such as mobile apps, companion bots, virtual assistants and so on. Since user interests may change frequently over time, the AI agents may continuously see unknown (new) user intents. Manual annotation can hardly catch up with such rapid development, which motivates the problem * Equal contribution.
of unknown intent detection that has recently attracted increasing interest from both academia and industry. While there have been some pioneering works studying the open-world classification problem in natural language processing (Fei and Liu, 2016;Shu et al., 2017), very few methods are designed for unknown intent detection. To our knowledge, the first work is by Lin and Xu (2019), in which the authors use large margin cosine loss (LMCL) to learn deep discriminative features and then feed them to a density-based outlier detection algorithm to identify unknown intents. Although this method performs well on some benchmark datasets, it has two limitations. (1) In training, LMCL ignores the prior knowledge of class labels, while it has been shown that label correlations captured in the embedding space can improve prediction performance, especially in the zero-shot learning scenarios (Palatucci et al., 2009;Ma et al., 2016). (2) LMCL computes the cosine distance between embeddings in the feature space and trains with a softmax cross-entropy loss, making the embedding distribution of each class long and narrow (Wan et al., 2018), which may be less suitable for applying density-based anomaly detection algorithms to detect unknown intents.
In this paper, we aim to address these limitations and propose a novel semantic-enhanced Gaussian mixture model (SEG) for unknown intent detection. In contrast to the softmax function, the Gaussian mixture model enforces embeddings to form ball-like dense clusters in the feature space, which may be more desirable for outlier detection, especially when using density-based outlier detection algorithms. Furthermore, we propose to inject the semantic information of class labels into the Gaussian mixture distribution by assigning the embeddings of class labels or descriptions to be the means of the Gaussians. This enables SEG to learn more class-concentrated embeddings that can benefit downstream outlier detection. We further use a large margin loss to make SEG learn more discriminative features and employ a density-based outlier detection algorithm LOF (Breunig et al., 2000) to detect unknown intents.
Identifying unknown intents is not enough for some application scenarios where it is important to know what exactly the new intents are, e.g., zero-shot intent classification. Current generalized zero-shot intent classification methods (Chen et al., 2016;Kumar et al., 2017;Xia et al., 2018;Liu et al., 2019) attempt to classify test instances directly by making predictions in the pool of all the seen and unseen intents. However, their prediction performances are quite low, and they are still far from practical use. In this work, we propose to integrate SEG as an unknown intent identifier into the generalized zero-shot intent classification pipeline. The basic idea is that correctly identifying if the intent of an utterance is known or unknown will make the subsequent intent classification task much easier. We conduct a case study on a state-of-the-art zero-shot intent classification method ReCapsNet (Liu et al., 2019). The results show that incorporating SEG successfully improves the performance of ReCapsNet by a large margin. It even pushes the performance to a practical level on the SNIPS dataset (Coucke et al., 2018).
The main contributions of this paper are summarized as follows.
• We propose a semantic-enhanced Gaussian mixture model (SEG) for unknown intent detection by incorporating class semantic information into a Gaussian mixture distribution.
• We explore to improve existing generalized zero-shot intent classification systems with an unknown intent identifier. To the best of our knowledge, this is the first attempt to apply unknown intent detection in this task.
• We conduct extensive experiments on three real-world datasets to validate the effectiveness of the proposed SEG model for unknown intent detection and its application in generalized zero-shot intent classification.
The rest of the paper is organized as follows. In Section 2, we review related works on intent classification and open-world classification. In Section 3, we discuss the proposed SEG model in details.
In Section 4, we present experimental results on unknown intent detection. In Section 5, we apply SEG to improve generalized zero-shot intent classification and conduct a case study. Finally, Section 6 concludes the paper.
2 Related Work

Intent Classification
User intent classification is an important component of dialogue systems. Great effort has been made to understand user intent across various domains, ranging from search engine questions (Hu et al., 2009) to medical queries (Zhang et al., 2016). Deep learning models including convolutional neural networks (CNN) (Xu and Sarikaya, 2013) and attention-based recurrent neural networks (RNN) (Ravuri and Stolcke, 2015;Liu and Lane, 2016) are commonly used for intent classification. CNN based methods build sentence embeddings by aggregating embeddings of adjacent words, while RNN based methods extract sentence embeddings via encoding word embeddings sequentially. Both types of methods have shown promising results in practice (Yin et al., 2017).
Traditional intent classification methods require considerable amount of labeled data for each class to train a discriminative classifier, while zero-shot intent classification (Sappadla et al., 2016; addresses the problem that not all intent categories are seen during the training phase, which is an important task in natural language understanding as novel intents may continuously emerge in dialogue systems (Liu and Lane, 2016;Xu and Sarikaya, 2013). Zeroshot intent classification aims to generalize knowledge and concepts learned from seen intents to recognize unseen intents. Early methods (Ferreira et al., 2015a,b;Yazdani and Henderson, 2015) explore the relationship between seen and unseen intents by introducing external resources such as manually defined attributes or label ontologies, but they are usually expensive to obtain. To deal with this, some methods (Chen et al., 2016;Kumar et al., 2017) map the utterances and intent labels to an embedding space and then model their relations in the space. Recently, IntentCapsNet-ZS (Xia et al., 2018) extends capsule networks (Sabour et al., 2017) for zero-shot intent classification by transferring the prediction vectors from seen classes to unseen classes. ReCapsNet (Liu et al., 2019) shows that IntentCapsNet-ZS hardly recognizes Figure 1: Illustration of the proposed framework for unknown intent classification. The backbone network is a selfattention Bi-LSTM encoder, which is trained by the proposed semantic-enhanced large margin Gaussian mixture loss (SEG classifier). In the testing phase, LOF is employed to detect outliers. The predicted outliers will be considered as unseen intent class instances, while the inliers will be classified by the SEG classifier.
utterances from unseen intents in the generalized zero-shot classification scenario, and proposes to solve this issue by transferring the transformation matrices from seen intents to unseen intents. In this paper, we use ReCapsNet as an example to show that incorporating an unknown intent identifier in the generalized zero-shot classification pipeline can significantly improve the prediction performance on unseen intents and the overall performance.

Open-world Classification
Most of existing classification methods make the closed-world assumption, that is, no new classes can appear in testing. However, the real world is open and dynamic, and in many applications, the AI agent cannot expect it sees everything in training, which makes open-world learning or classification an important problem.
There are two major approaches to tackle openworld classification. One is to use the classifier to output an additional confidence score to measure the probability that a test sample is seen or unseen. cbsSVM (Fei and Liu, 2016) proposes a center-based similarity (CBS) learning strategy and employs SVM to build 1-vs-rest CBS classifiers. MSP (Hendrycks and Gimpel, 2017) proposes to use the maximum softmax probability as the confidence score. Instead of using Softmax as the final output layer, DOC (Shu et al., 2017) builds a multiclass classifier with a 1-vs-rest final layer which contains a sigmoid function for each seen class to reduce the open space risk.
The other approach is to treat the open-world classification as an outlier detection problem by ex-ploiting anomaly detection methods such as robust covariance estimators (Rousseeuw and Driessen, 1999), one-class SVM (Schölkopf et al., 2001), isolation forest (Liu et al., 2008) and local outlier factor (Breunig et al., 2000). Robust covariance estimators assume data follows a Gaussian mixture distribution. Based on this, it tries to fit an elliptic envelope, and outliers can be defined as points standing far enough from the fit shape. One-class SVM finds a hyperplane that circles the positive samples as the decision boundary. Isolation forest uses a binary search tree (isolated tree) to isolate samples. Due to the small number of outliers and their alienation from most samples, outliers will be isolated earlier and be closer to the root node of the isolated tree. Local outlier factor (LOF) is a density-based algorithm, which compares the density of a point and its neighbors to determine whether it is an abnormal point. Lower density means it is more likely to be identified as an abnormal point. In addition, to facilitate anomaly detection, some methods (Lin and Xu, 2019;Wan et al., 2018) use large margin loss functions to learn more discriminative feature representations.

Feature Extraction
Given an utterance x = {w 1 , w 2 , . . . , w T } with T words, where w t ∈ R dw is the embedding of the t-th word. Each word can be further encoded sequentially using a bidirectional LSTM (BiLSTM), i.e., where are the hidden states of the word w t by forward LSTM f w and backward LSTM bw respectively. The word w t is encoded as the entire hidden state, which is represented by , and the hidden state matrix of the utterance can be represented as H = [h 1 , h 2 , . . . , h T ] ∈ R 2d h ×T . Furthermore, we use the self-attention mechanism to obtain the sentence embedding. Specifically, where a ∈ R T is the self-attention weight vector, W s1 ∈ R da×2d h and W s2 ∈ R 1×Da are trainable parameters, W ∈ R dz×2d h is also trainable feedforward weight parameter, and z ∈ R dz is the final representation of the utterance x.

Semantic-Enhanced Large Margin Gaussian Mixture Loss
The softmax cross-entropy loss is widely used in many machine learning problems. However, the embedding distribution of each class learned by the softmax cross-entropy loss tends to be long, narrow, and radiating from the center, with different classes distributed next to each other closely (Wan et al., 2018). Such embedding distribution may not be ideal for detecting new intent classes, as there might not be much space for new classes. Nevertheless, the Gaussian mixture loss can enforce each class to gather into a dense and small cluster, which may be more desirable for detecting new intents.
Here, we design a semantic-enhanced large margin Gaussian mixture loss for embedding learning.
Large-Margin Cross-Entropy Loss Given a Kway classification task, we assume the extracted feature vector (embedding) z of the training samples follows a Gaussian mixture distribution, where µ k and Σ k are the mean and covariance of class k in the embedding space respectively and p(k) is the prior probability of class k. The probability density function of z is given by where N (z; µ k , Σ k ) is the Gaussian distribution. For the embedding z i of any training sample x i , the posterior probability that z i belongs to its class y i can be expressed as . (4) The cross-entropy loss of z i between the true class label y i and the inference p(y i |z i ) can then be computed as: and the total loss of N training samples is Let d k be the Mahalanobis distance between z i and µ k , i.e., Then L ce,i can be expressed as Consider a simplified case where p(k) and Σ k are identical for all classes. In this case, the model will give a correct prediction of z i if the distance of z i to its class mean µ y i is less than or equal to its distance to any other class mean.
In general, large margin loss helps to improve classification performance. Here, we also introduce a classification margin m ∈ [1, +∞) into the crossentropy loss, which then becomes: With the large margin loss, z i is correctly classified only when its distance to class mean µ y i is significantly less than (no more than 1 m of) its distance to any other class mean.

Semantic Enhancement via Class Description
This is one of the key features of our proposed method. We inject the semantic information of each class into the Gaussian mixture model by assigning the embedding learned from the text description d k of class k to be the class centroid µ k . The text description d k can either be a single-word class name or a sentence or paragraph that describes the class. That is, where feature extract(·) indicates the feature extraction module in Section 3.1.
Generation Loss In addition to the cross-entropy loss, we want to maximize the observed likelihood of the embeddings with the Gaussian mixture distribution. Specifically, we minimize the following negative logarithm likelihood, where const means a constant number. As shown in Eq. (11), the generation loss L g encourages the embedding z i to be close to its class centroid µ y i , which facilitaes learning a more class-concentrated embedding distribution that may benefit the downstream outlier detection task. By combining the cross-entropy loss and the generation loss, the total objective function is: where λ is a trade-off parameter.

Outlier Detection
By the above feature learning procedure, each utterance x can be encoded as an embedding z. Then, the embedding z is fed to a well-known outlier detection algorithm LOF (Breunig et al., 2000) to detect new or unknown intents (outliers). LOF is an unsupervised density-based anomaly detection method based on the following intuition. By comparing the local density of an object to those of its neighbors, it can identify regions of similar density. The objects with substantially lower density than their neighbors' are considered to be outliers. LOF defines the local outlier factor of an object z as  where N k (z) denotes the set of k-nearest neighbors of z, and "lrd" denotes the local reachability density which measures the local density around an object. The local reachability density is defined as the inverse of the average reachability distance between z and its neighbors, i.e., . (14) Here, the reachability distance reach-dist k (z, o) is defined as where k-dist(o) denotes the distance of the object o to its k-th nearest neighbor, and d(z, o) is the distance between z and o.
If the LOF factor of an utterance is much larger than 1, it has substantially lower local density than its neighbors', which means the utterance embedding is relatively distant from its neighbors. Hence, it can be inferred the utterance is likely to belong to an unknown intent class. Figure 1 illustrates the overall training and testing procedures of the proposed framework for unknown intent detection. The backbone network is a self-attention Bi-LSTM encoder. In the training phase, the encoder is trained by minimizing the semantic-enhanced large margin Gaussian mixture loss (SEG classifier) as in Eq. (12) on the training samples (seen intent class instances). In the testing phase, user utterances may come from both seen and unseen intent classes. Given an utterance, we first obtain its feature representation z with the trained encoder, then we use LOF to decide whether z is an outlier or not. If z is an outlier, we take it as an instance of some new intent class. Otherwise, we classify z to one of the seen intent classes using the SEG classifier.

Experiments
In this section, we present experimental results on unknown intent detection. Formally, we train an  unknown intent detection system with training data D tr = (X tr , Y tr ), where Y tr ∈ {l 1 , · · · , l K } = C seen (the set of seen intent classes). For test utterances of seen intents, the unknown intent detection system aims to assign correct intent labels to them. For test utterances of unseen intents, the system is expected to identify them as outliers.
SNIPS is an open-source single-turn English corpus, which contains 7 types of user intents across different domains. ATIS is also an English dataset, which contains 18 types of user intent in the airline travel domain. SMP-2018 is a Chinese dialogue corpus for user intent recognition, which contains 30 different types of user intents. The statistics of the datasets are summarized in Table 1. We compare SEG with the following unknown intent detection methods.
• Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017) considers the maximum softmax probability of a sample as the confidence score to measure the probability that it belongs to a seen intent. The smaller the confidence score is, the more likely it belongs to an unknown intent.
• DOC (Shu et al., 2017) builds m 1-vs-rest sigmoid classifiers for m seen classes respectively. The maximum probability is considered as the confidence of whether the sample belongs to the seen intent.
• Softmax. It can be considered as an ablation study of our method SEG, which uses softmax instead of Gaussian mixture distribution to learn discriminative features.
• LMCL (Lin and Xu, 2019) uses large margin cosine loss instead of Gaussian mixture distribution to learn discriminative embeddings.
• SEG/o. A variant of our method SEG. It does not inject the class semantic information into the Gaussian mixture model.

Experimental Setup
We follow the setting in LCML (Lin and Xu, 2019) for unknown intent detection. Considering that some datasets may be unbalanced, we randomly select seen intents by a weighted random sampling over the entire intent set. The rest of the intents are regarded as unknown. We randomly select 30% samples of each intent to form the test set. The rest of each seen intent is added to the training set. We also follow LMCL to use macro f1-score as the evaluation metric, which makes sense because the ATIS dataset is extremely unbalanced. For SNIPS, ATIS and SMP-2018, we use 300dim embeddings pre-trained on Fasttext, Glove, and Chinese-Word-Vectors respectively. For BiL-STM, we set the number of layers as 2 and the output dimension as 128. In the self-attention layer, we set the attention dimension d a =10. After the self-attention layer, we project the feature vector to a d z -dimension vector via a linear layer. We set d z =12 for SNIPS and SMP-2018, and d z =4 for ATIS. We report the average results over 10 runs. For the loss function, we set the margin m = 1 and the trade-off parameter λ = 0.5.
For MSP, we set the threshold as 0.5 following Lin and Xu (2019). For DOC, we set the threshold as 0.5 as used in the original paper. During training of MSP and DOC, we clip the gradient norm to avoid gradient exploding. For LMCL, we follow the original paper to set the scaling factor s = 30 and the cosine margin m = 0.35. Softmax, LMCL, SEG/o and SEG all use LOF as the outlier detector, and we use the same set of parameters for LOF.

Result Analysis
From Table 2, it can be seen that our method SEG outperforms the baselines in most cases. Especially, on the most challenging dataset SMP-2018, SEG and SEG/o outperfom others by a large margin, demonstrating its high effectiveness. Moreover, we can make the following observations: (1) SEG consistently outperforms SEG/o in most cases, which proves the effectiveness of the proposed semantic enhancement mechanism.
(2) SEG/o generally has higher scores than Softmax and LMCL, especially on the more complex dataset SMP-2018, where significant gaps can be observed. The results indicate the advantage of Gaussian mixture model over Softmax and the variant LMCL for learning class-concentrated embeddings, which are more suitable to be coupled with the outlier detector LOF.
(3) All the methods work well on SNIPS, which is a simple dataset. MSP and DOC outperform other methods on ATIS with only 25% seen classes. However, as the proportion of seen class increases, we can see a significant decline in their performance. This is because ATIS is severely imbalanced where one intent accounts for 96% of the entire data. When there are many seen classes, DOC and MSP cannot learn an effective supervised classifier due to the dominance of one class.

Application in Generalized Zero-shot Intent Classification
In this section, we apply our method SEG to an extended application of unknown intent classification -zero-shot intent classification. It aims to discriminate unseen intents, which is beyond only detecting their existence. Specifically, given the training data D tr = (X tr , Y tr ) where Y tr ∈ C seen , a zero-shot classification system is trained to predict the label y te of any test sample which may belong to an unseen class, using the knowledge transferred from the seen data. There are two common settings for zero-shot learning, generalized zero-shot classification, whereŷ te ∈ {C seen , C unseen }, and standard zero-shot classification, whereŷ te ∈ C unseen . Here, C unseen is the set of unseen intent classes. Previous attempts try to tackle the challenge of zero-shot intent classification from three directions.
(2) How to better utilize these prior knowledge to extract more informative semantic representations, such as data augmentation and hierarchical representations learned by capsule networks (Xia et al., 2018). (3) With the extracted semantic features, how to design a better zero-shot learning strategy, such as reconstructing weight matrix for unseen intents through relation learning (Liu et al., 2019).
In this work, we improve generalized zero-shot intent classification by integrating the proposed SEG model as a binary unknown intent identifier into the original pipeline. We explore multiple ways of integration and conduct a case study based on a state-of-the-art method ReCapsNet (Liu et al., 2019).

Integrating Unknown Intent Identifier
As shown in Figure 2, a typical generalized zeroshot classification framework can be abstracted into two layers, the encoder layer and the zero-shot classifier layer. In the encoder layer, a user utterance x in the text format needs to be first mapped to the semantic representation z ZS x . In addition, it is common to encode class information as S for better semantic learning or knowledge transfer. In order to learn better semantic representation, prior knowledge is usually incorporated at this stage. Then, the learned representation will be fed to the zero-shot classifier layer. Various zero-shot classification strategies have been proposed to transfer knowledge to new categories. Finally, the system outputs the predictionŷ te ∈ {C seen , C unseen } for the utterance x.
We integrate SEG into the pipeline between the encoder layer and the classifier layer as shown in Figure 3. With the semantic feature z x , we predict if the utterance x is an outlier via: p(g|z x ), g ∈ {"seen", "unseen"}. For the case g = "seen", the intent of the utterance is considered to be a seen one. We then predict the intent by p(y|z x , y ∈ C seen , X tr , θ) where θ denotes the parameters of the original framework. Otherwise, the intent of the utterance is considered to be unseen, and we predict it via p(y|z x , y ∈ C unseen , X tr , θ).
Feature Assemble We adopt two ways "Separate" and "Combine" to assemble features for the following outlier detection task.
• Separate (Sep). We directly feed the output of the pre-trained SEG encoder z SEG x to LOF for outlier detection, i.e., (17) • Combine. To take advantage of the original model, we first obtain the original semantic feature representation z ZS x and define a transform function f . Then, f (z ZS x ) is concatenated with the pre-trained features by SEG, z SEG x , to make a combined feature representation:

A Case Study on ReCapsNet
ReCapsNet Recently, ReCapsNet-ZS (Liu et al., 2019) demonstrates state-of-the-art performance in generalized zero-shot intent classification. In this section, we conduct a case study on integrating the new intent identifier into ReCapsNet. The framework of ReCapsNet is illustrated in Figure 4. In the encoder layer, each utterance x is encoded with R semantic capsules [m 1 , m 2 , ..., m R ] as the representations in R different semantic spaces. In addition, the training set D tr and class labels L are encoded as S tr and S C , respectively. In the zero-shot classifier layer, z ZS x is fed to a capsule network to make prediction. Each seen class k has R transformation matrices {W kr } R r=1 . In the testing phase, ReCapsNet reconstructs the r-th transformation matrix for each unseen class l as W lr = k q lk W kr , where q lk is the relation between unseen class l and seen class k learned from (S tr , Y tr ) and S C by metric learning.
For the variant "Combine", to exploit the property that each utterance is variously represented in different semantic spaces as discussed in Liu et al. (2019), we define the semantic feature representation of ReCapsNet as Experimental Setup We integrate SEG into the ReCapsNet pipeline with both "Sep" and "Combine" variants and test the performance of generalized zero-shot classification. Following the settings of generalized zero-shot classification in Liu et al. (2019), we test our methods on two datasets SNIPS (Coucke et al., 2018) and SMP-2018 (Zhang et al., 2017) and report the micro-averaged recall (accuracy) and F1 scores. The baselines include DeVISE (Frome et al., 2013), CMT (Socher et al., 2013), CDSSM (Chen et al., 2016), Zero-shot DNN (Kumar et al., 2017), Intent-   Table 3: Results of generalized zero-shot intent classification equipped with our new intent identifier SEG. "Seen", "Unseen" and "Overall" respectively denote the prediction performance on the utterances from seen intents, unseen intents, and both seen and unseen intents. The suffixes "/w" and "/o" stand for with and without semantic enhancement, respectively. The top 2 results for each metric are marked in bold.
CapsNet (Xia et al., 2018), and ReCapsNet (Liu et al., 2019). The average results over 10 runs of our methods and ReCapsNet are reported in Table  3, where the results of other baselines are taken from Liu et al. (2019).
We use the same setting and hyper-parameters as in ReCapsNet (Liu et al., 2019). We set d z =4 for SNIPS and d z =12 for SMP-2018. The rest of the parameters of SEG are the same as those used in Section 4.2. In addition, we also conduct an ablation study to demonstrate the effectiveness of the proposed semantic enhancement mechanism by testing two variants of our integration ("Sep / o" and "Combine / o") without using it.
Result Analysis From the results in Table 3, we can make the following observations: (1) All variants of our integration achieve a significant boost in the overall accuracy and F1 scores on the two datasets, especially on SNIPS, where the performance increase is huge. Each variant leads to a qualitative leap in the performance on unseen intents. The prediction accuracy (microaveraged recall) on seen intents may be reduced compared to ReCapsNet and other baselines, since some utterances of seen intents are classified to unseen intents. However, the F1 score on seen intents increases significantly, indicating that it has much higher precision than that of the baselines.
(2) The variants of our integration with semantic enhancement significantly outperform those without using it on predicting unseen intents by very large margins. Although their accuracy scores on seen intents are lower, their overall accuracy and F1 scores are consistently better, which confirms the effectiveness of semantic enhancement.
(3) It can be seen that the "Combine" variants generally perform much better than the "Sep" variants, especially the one with semantic enhancement ("Combine / w"), which performs outstandingly. It surpasses the performance of "Sep / w" in every metric, demonstrating the usefulness of the simple feature assemble strategy of concatenating the feature representations of ReCapsNet and SEG.

Conclusion
In this paper, we have proposed SEG, a semanticenhanced Gaussian mixture model coupled with a LOF outlier detector, for unknown (new) intent detection. We empirically verified the effectiveness of SEG for unknown intent detection on real dialogue datasets in English and Chinese. Furthermore, we successfully applied SEG to improve generalized zero-shot intent classification and achieved remarkable performance gain over a most recent competitive method ReCapsNet. In future work, we plan to conduct more empirical studies on SEG and further improve its performance on new intent identification. We also plan to conduct more case studies in applying SEG to boost the performance of current zero-shot intent classification methods.