Reconstructing Capsule Networks for Zero-shot Intent Classification

Intent classification is an important building block of dialogue systems. With the burgeoning of conversational AI, existing systems are not capable of handling numerous fast-emerging intents, which motivates zero-shot intent classification. Nevertheless, research on this problem is still in the incipient stage and few methods are available. A recently proposed zero-shot intent classification method, IntentCapsNet, has been shown to achieve state-of-the-art performance. However, it has two unaddressed limitations: (1) it cannot deal with polysemy when extracting semantic capsules; (2) it hardly recognizes the utterances of unseen intents in the generalized zero-shot intent classification setting. To overcome these limitations, we propose to reconstruct capsule networks for zero-shot intent classification. First, we introduce a dimensional attention mechanism to fight against polysemy. Second, we reconstruct the transformation matrices for unseen intents by utilizing abundant latent information of the labeled utterances, which significantly improves the model generalization ability. Experimental results on two task-oriented dialogue datasets in different languages show that our proposed method outperforms IntentCapsNet and other strong baselines.


Introduction
With the advent of conversational AI, taskoriented spoken dialogue systems are becoming ubiquitous, e.g., chatbots deployed on different applications, or modules integrated in the popular virtual personal assistants like Apple Siri or Microsoft Cortana . To improve business effectiveness and user satisfaction, accurately identifying the intents behind user ut- * Equal contribution. † Corresponding author.
terances is indispensable. However, it is extremely challenging not only because user queries are sometimes short and expressed diversely, but also because it may continuously encounter new or unacquainted intents popped up quickly from various domains. Conventional intent classification methods (Hu et al., 2009;Tur et al., 2012;Xu and Sarikaya, 2013;Ravuri and Stolcke, 2015;Liu and Lane, 2016; typically train a supervised learning model on large amounts of labeled data, and are not effective in recognizing emerging unseen intents. Several zero-shot learning approaches attempted to address the challenges for classifying intents whose instances are not present during training. One common idea is to utilize some external resources (Ferreira et al., 2015a,b;Yazdani and Henderson, 2015;Kumar et al., 2017;Zhang et al., 2019) such as label ontologies or manually defined attributes. However, such external resources are usually unavailable, as they require substantial extra time and expensive human labour to produce. To implement zero-shot intent classification more easily and intelligently, recent works rely more on the word embeddings of intent labels, which can be easily pretrained on text corpus. Methods proposed by Chen et al. (2016) and Kumar et al. (2017) utilize neural networks to project intent labels and data samples to the same semantic space and then measure their similarity. However, learning a good projection function is usually difficult due to the diversity of user expressions, especially in some complex domains such as medical queries (Zhang et al., 2016).
Unlike previous models, IntentCapsNet (Xia et al., 2018) employs capsule networks to extract high-level semantic features and then transfers the prediction vectors for seen intents to unseen intents. Although IntentCapsNet has achieved impressive performance in some zero-shot intent Figure 1: Illustration of our framework ReCapsNet-ZS. In the training process, labeled utterances are first encoded by Bi-LSTM. Then, a set of semantic capsules are extracted via the dimensional attention module. Finally, these semantic capsules are fed to a capsule network to train a model for predicting the seen intents. In the testing process, to predict the unseen intents, a metric learning method is trained on labeled utterances and intent label embeddings to learn the similarities between the unseen and seen intents. Then, the learned similarities and the transformation matrices for the seen intents trained by capsule networks are used to construct the transformation matrices for the unseen intents. When a test utterance arrives, it is first encoded into semantic capsules by the trained Bi-LSTM and dimensional attention module. There are two settings for intent classification. (1) Zero-shot intent classification: only utterances of the unseen intents participate in testing, so each utterance needs to be classified to one of the unseen intents. In this case, only the transformation matrices for the unseen intents are used for prediction.
(2) Generalized zero-shot intent classification: test utterances may come from both the seen and unseen intents, so each utterance needs to be classified to either a seen or an unseen intent. In this case, the transformation matrices for the seen and unseen intents are all used for prediction. classification tasks, it has two unaddressed limitations. (1) The self-attention module of Intent-CapsNet cannot handle polysemy, which weakens the representation capacity of semantic capsules.
(2) For the generalized zero-shot classification setting where newly arrived utterances come from both seen and unseen intents, the method of In-tentCapsNet for constructing the prediction vectors can easily cause the model to completely fail in detecting unseen intents, which is clearly undesirable and inadequate for real dialogue systems.
In this paper, we propose to reconstruct capsule networks for zero-shot intent classification (ReCapsNet-ZS), which effectively addresses the limitations of IntentCapsNet and adapts well to the generalized zero-shot intent classification tasks. As illustrated in Figure 1, ReCapsNet-ZS consists of two components. First, it introduces a dimensional attention module to alleviate the polysemy problem, which helps to extract semantic features for capsule networks. Second, it computes the similarities between unseen and seen intents by utilizing the rich latent information of labeled utter-ances, and then constructs the transformation matrices for unseen intents with the computed similarities and the trained transformation matrices for seen intents, which greatly improves the generalization ability to unseen intents.
To verify the effectiveness of the proposed ReCapsNet-ZS for zero-shot intent classification, we conduct extensive experiments on two real task-oriented dialogue datasets in English and Chinese respectively. The empirical study validates our proposals and shows promising results of ReCapsNet-ZS, which are significantly better than state-of-the-art methods, especially on the generalized zero-shot intent classification tasks.

Related Works
Zero-shot Intent Classification. Zero-shot learning (Larochelle et al., 2008;Palatucci et al., 2009) aims to use the knowledge learned from seen classes, of which abundant labeled samples are typically available for training, to recognize unseen classes, of which no labeled samples are provided. It has been widely studied in computer vision Xian et al., 2016) and natural language processing (Sappadla et al., 2016;Zhang et al., 2019).
Zero-shot intent classification is an important and challenging task for many natural language understanding applications (Hu et al., 2009;Liu and Lane, 2016;Xu and Sarikaya, 2013), in which new intents emerge constantly and they cannot be easily recognized. Several methods have been proposed to tackle this problem. Ferreira et al. (2015a,b) and Yazdani and Henderson (2015) utilize external resources such as label ontologies or manually defined attributes to find the relationship between seen and unseen intent labels. However, the external resources are usually difficult to obtain, as collecting them is labor intensive and time consuming. Chen et al. (2016) and Kumar et al. (2017) project the utterances and intent labels to a same semantic space and then compute the similarities between utterances and intent labels (Chen et al., 2016;Kumar et al., 2017). However, diverse user expressions may make it difficult to learn a good projection function and thus affect the classification performance. Recently, Xia et al. (2018) extend capsule networks for zero-shot intent classification by transferring the prediction vectors from seen classes to unseen classes. However, there are some key issues left to be resolved, including how to deal with polysemy in word embeddings and how to improve the model generalization ability to unseen intents in the generalized zero-shot intent classification setting.
Capsule Networks. Capsule Networks (Sabour et al., 2017) were first proposed to address the shortcomings of convolutional neural networks (CNN) in the domain of computer vision. It allows the networks to learn part-whole invariant relationships consecutively. Recently, some studies have attempted to apply capsule networks in the domain of natural language processing (Yang et al., 2018;Geng et al., 2019;Xia et al., 2018) and obtained promising results. Yang et al. (2018) first extend capsule networks for text classification. Geng et al. (2019) successfully combine the dynamic routing algorithm with some metalearning framework for few-shot text classification. However, their model still requires some labeled samples for each class. Xia et al. (2018) propose a model based on capsule networks for zeroshot intent classification and has achieved state-of-the-art performance, but as mentioned above, their model has some intrinsic limitations remained to be addressed.

Problem Formulation
Given the set of all intent labels Y = Y s Y u , where Y s = {y s 1 , y s 2 , . . . , y s K } and Y u = {y u 1 , y u 2 , . . . , y u L } are the sets of seen and unseen intent labels respectively. There is no overlap between Y s and Y u , i.e., Y s Y u = ∅, and K and L are the numbers of seen and unseen intent labels respectively. The embeddings of the seen and unseen intent labels are denoted by E s = {e s 1 , e s 2 , . . . , e s K } and E u = {e u 1 , e u 2 , . . . , e u L } respectively. Each embedding is a d-dimensional vector. For all the seen and unseen intent labels, their associated embeddings are available. The sample (utterance) sets for the seen and unseen intent labels are denoted by where n s is the number of instances of the seen labels and n u is the number of instances of the unseen labels.
Zero-shot Intent Classification. For this setting, the training set is X tr = {X s , Y s }, and X u is not available for training. In the test phase, the goal is to assign an unseen intent label y ∈ Y u to a given utterance.
Generalized Zero-shot Intent Classification. For this setting, the training procedure is the same as above, while the difference is in the test phase, where the goal is to assign an intent label y ∈ Y s Y u to a given utterance.
In this paper, we aim to reconstruct capsule networks for handling both of the two settings of zero-shot intent classification.

Limitations of IntentCapsNet
IntentCapsNet (Xia et al., 2018) is the first work to employ capsule networks for zero-shot intent classification. It exploits the self-attention mechanism to extract semantic features (capsules) of an utterance. For zero-shot intent classification, it utilizes the vote vectors of seen intents and the similarities between seen and unseen intents based on Euclidean distance to make predictions for unseen intents. Although IntentCapsNet has demonstrated strong performance, it has two fundamental limitations.
Limitation 1. The self-attention module of In-tentCapsNet cannot handle the polysemy problem, which limits the representation capacity of semantic capsules.
Typically, a word is represented by a multidimensional embedding. Since a word can have different meanings in different contexts, some interesting recent studies Ş enel et al., 2018) suggest that different dimensions of a word embedding may tend to represent different semantic meanings. For example, the word "book" has different meanings in the two utterances: "Book a restaurant in Michigan for 4 people" and "Give 4 out of 6 points to this book". For the embedding of the word "book", it is hypothesized that some dimensions may be more indicative for the first meaning -"reserve", while some other dimensions may be more indicative for the second meaning. Apparently, the self-attention mechanism cannot pay more attentions to the dimensions that best describe the specific meaning of a word in a given context, as it assigns the same attention score for all the dimensions, which significantly limits the representation capacity of semantic capsules and undermines the performance of capsule networks.
Limitation 2. For the generalized zero-shot classification setting, the method of IntentCapsNet for constructing the prediction vectors is highly likely to cause the model to lose generalization ability to unseen intents.
Here, we provide an analysis of IntentCapsNet for predicting an unseen intent in the generalized zero-shot classification setting. In IntentCapsNet, the probability of a test utterance x belonging to a seen intent label k is computed as: where · is the L2-norm of a vector, R is the number of semantic capsules, p k|r is the prediction vector for the r-th semantic capsule with respect to the seen intent k, and c kr is the weight of the r-th semantic capsule with respect to the seen intent k, which is computed by the dynamic routing algorithm of capsule networks. g k,r = c kr p k|r is called the r-th vote vector for the seen intent k. By Eq.
(1), we have a tight upper bound for P k : IntentCapsNet computes the probability of x belonging to an unseen intent label l as: where u l|r is the prediction vector for the r-th semantic capsule with respect to the unseen intent l, and c lr is the weight of the r-th semantic capsule with respect to the unseen intent l, which is determined by the dynamic routing algorithm. u l|r = K k=1 q lk g k,r , where K is the number of seen intents, g k,r is the r-th vote vector for the seen intent k, and q lk is the similarity between an unseen intent y u l ∈ Y u and a seen in- , where e s k and e u l are the embeddings of the seen and unseen intents respectively, and d(e s k , e u l ) is the scaled squared Euclidean distance between e s k and e u l . Since q lk ∈ (0, 1), c lr ∈ (0, 1), K k=1 q lk = 1 and R r=1 c lr = 1, we have a tight upper bound for P l : where g k,r max is the maximum among g k,r , ∀r ∈ {1, 2, . . . , R} and ∀k ∈ {1, 2, . . . , K}. By Eq.
(2) & (4), it can be seen that the upper bound of P k is much larger than P l , indicating that for any utterance x, it is highly likely that P (y ∈ Y s |x) is larger than P (y ∈ Y u |x). Hence, for generalized zero-shot classification, with high probability IntentCapsNet will classify a test utterance to the seen intents, which is also verified by our experiments in section 5.

The Proposed Approach
To overcome the limitations of IntentCapsNet, we propose to reconstruct capsule networks for zeroshot intent classification. In particular, we introduce two modules to capsule networks: (1) a dimensional attention module that helps to extract more representative semantic capsules and (2) a new method for constructing the transformation matrices to improve the model generalization ability to unseen intents.

Dimensional Attention Capsule Networks
Pre-processing. An utterance with T words can be represented as x = {w 1 , w 2 , . . . , w T }, where w t ∈ R dw is the word embedding of the t-th word and can be pretrained by the skip-gram model . Each word can be further encoded sequentially using a recurrent neural network such as bidirectional LSTM (Hochreiter and Schmidhuber, 1997) where LSTM f w and LSTM bw denote the forward and backward LSTM respectively, and − → h t ∈ R d h and ← − h t ∈ R d h are the hidden states of the word w t learned from LSTM f w and LSTM bw respectively. The entire hidden state of w t is represented by concatenating , and the hidden state matrix of the utterance is

Extracting Semantic Capsules with Dimensional Attention
In general, an utterance is composed of multiple semantic features, and these semantic features collectively contribute to a more abstract intent label. For example, an utterance "I want to know the temperature of Hong Kong" is composed by multiple semantic features such as get action (want to know), weather (temperature), and city name (Hong Kong), and these semantic features collectively reflect the intent label "Get Weather". Capsule networks provide a hierarchical reasoning structure for modeling semantic features for intent classification. First, the primary capsules in capsule networks can properly match multiple semantic features of an utterance. Second, the dynamic routing mechanism of capsule networks can be used to automatically learn the importance weight of each semantic feature and aggregate them into a high-level intent label. It is assumed that a high-level semantic feature of an utterance is largely generated by some of its words that have similar semantic meaning (Xia et al., 2018). To extract the semantic features of an utterance, the key problem is to learn the importance weight of each word for a semantic feature. IntentCapsNet (Xia et al., 2018) utilizes the self-attention mechanism to extract the semantic features (capsules) of each utterance. However, self-attention cannot effectively deal with polysemy. Inspired by the work of Shen et al. (2018), we propose to use the dimensional attention mechanism to alleviate the polysemy problem in extract-ing semantic features. Dimensional attention can automatically assign different attention scores to different dimensions of a word embedding, which not only helps to solve the polysemy problem to some extent, but also expands the search space of the attention parameters, thus improving model flexibility and effectiveness.
Assume each utterance has R semantic features. We propose to learn a dimensional attention matrix A r ∈ R 2d h ×T that encodes the dimensional attentions of the T words with respect to the r-th semantic feature by: where F 1 ∈ R da×2d h and F 2 ∈ R 2d h ×da are the trainable parameters, and A r (i, j) (the element of A r in the i-th row and j-th column) means the importance weight of the i-th dimension of the jth word embedding to the r-th semantic feature. Compared with self-attention, dimensional attention can help to choose the appropriate dimensions of a word embedding that can best express the specific meaning of the word in a given context. After obtaining A r , the r-th semantic feature m r ∈ R 2d h is computed by: where is element-wise multiplication, and row is an operator that sums up elements of each row. The entire semantic features for each utterance is M = [m 1 , m 2 , . . . , m R ] ∈ R 2d h ×R .

Improved Max-margin Loss
The semantic features of the utterance can then be fed into a capsule network to learn the intent. First, we transform each semantic feature m r of the utterance to a prediction vector with respect to each intent as: where p k|r ∈ R dp is the prediction vector of the rth semantic feature with respect to the k-th intent, and W kr ∈ R dp×2d h is the associated transformation matrix.
In training, there are K output capsules, corresponding to K seen intents. The k-th output capsule o k is the weighted sum of all the prediction vectors p k|r (r ∈ {1, . . . , R}), where c kr is the coupling coefficient representing the contribution degree of the r-th semantic feature to the k-th intent, which can be computed by the dynamic routing algorithm (Algorithm 1).
Then, a squashing function squash(·) is applied on o k , and the final output capsule of the k-th intent is: Now, the probability of the existence of the k-th intent can be represented as the length of the output capsule v k . The computation procedure of v k is shown in Algorithm 1, where p k|r · v k denotes the inner product between p k|r and v k . To train the dimensional attention capsule network, we propose an improved max-margin loss function consisting of two parts.
The first part is the max-margin loss on each labeled utterance, which is the original loss function of capsule networks (Sabour et al., 2017): where y k = 1 if the utterance is of intent label k and y k = 0 otherwise, λ is a down-weighting parameter, and m + and m − are the margins.
The second part is to ensure the diversity of the semantic capsules, i.e., different semantic capsules are likely to be generated by different words in an utterance. The importance weight of each word to the r-th semantic capsule can be represented by the average value of each column of the dimensional attention matrix A r , i.e., where s r ∈ R 1×T and col is an operator that sums up elements of each column. Denote by S = [s 1 , s 2 , . . . , s R ] ∈ R T ×R the importance weight matrix of each word to all the R semantic capsules.
To ensure the diversity of the semantic capsules, a natural idea is to constrain the columns of S to be orthogonal with the following loss function: where · F is the Frobenius norm of a matrix. Combining Eq. (11) and Eq. (13), the overall loss function of the proposed dimensional attention capsule network is: where β is a trade-off parameter. By minimizing L total with gradient descent methods, all model parameters including F 1 , F 2 and W kr can be learned.

Zero-shot Intent Classification
To solve zero-shot intent classification with capsule networks, two key problems need to be addressed.
(1) How to find the relationship between unseen and seen intents? (2) How to make predictions for unseen intents?
Measuring Intent Relations. To tackle the first problem, we propose to learn a Mahalanobis distance metric to measure the relationship between unseen and seen intents. Specifically, given the embeddings of an unseen intent l and a seen intent k, their squared Mahalanobis distance is given by: where Ω is a learnable covariance matrix which models the correlation between dimensions of the embedding. Note that IntentCapsNet (Xia et al., 2018) also tries to use Eq. (15) to model the relationship between unseen and seen intents, but it ignores the correlation between dimensions and simply sets Ω = σ 2 I (σ is a scaling hyper-parameter), which is actually a scaled squared Euclidean distance.
As the number of intents is limited, it is difficult to learn a desirable covariance matrix Ω with the intent embeddings only. Fortunately, we can leverage the word embeddings of the utterances, which come from the same semantic space as the intent embeddings (pre-trained by the same skip-gram model). Hence, we propose to learn the covariance matrix Ω with the labeled utterances in the training set. Inspired by the work of Ying and Li (2012), we propose to learn the Mahalanobis distance metric by optimizing the objective: where D and S respectively denote the pair sets in which utterances belong to different classes and the same class. For an utterance i, z s i denotes the average sum of all the word embeddings. As shown by Ying and Li (2012), optimizing Eq. (16) with respect to Ω is equivalent to solving an efficient eigenvalue optimization problem. With the learned metric Ω, we can have the relationship between any unseen and seen intents by substituting it into Eq. (15). Furthermore, we can compute the similarity between them by q lk = exp (−α · d M (e u l , e s k )), where α is a scaling parameter.
Constructing Transformation Matrices. Intuitively, if an unseen intent is similar to a seen intent, their corresponding transformation matrices should also be similar. Based on this, to solve the second problem, we propose to derive the transformation matrices of unseen intents using the transformation matrices of seen intents and the similarities between unseen and seen intents, and then make predictions for unseen intents with the transformation matrices. Specifically, given a matrix Q ∈ R L×K that encodes the similarities between unseen and seen intents, for an unseen intent l, we propose to construct the transformation matrix W lr for the r-th semantic capsule with respect to the l-th unseen intent by: where q lk is the element in the l-th row and the kth column of Q, W kr is the transformation matrix for the r-th semantic capsule with respect to the kth seen intent. By Eq. (17), the transformation matrices for all unseen intents can be obtained. When a test utterance arrives, it can be directly fed into the trained dimensional attention capsule network for intent prediction.

Datasets
We evaluate our model on two real task-oriented dialogue datasets in different languages. Table 1 summarizes the dataset statistics.

SMP-2018.
It is a real Chinese dialogue corpus released in SMP 2018 (The China National Conference on Social Media Processing) for user intent classification tasks in Chinese (Zhang et al., 2017). The dataset is provided by the iFLYTEK Corporation, and it can be divided into two parts: chit-chat dialogues and task-oriented dialogues.
Here, we only use the task-oriented dialogues.
Dataset Splitting. For zero-shot intent classification, we take all the samples of seen intents as the training set, and all the samples of unseen intents as the test set. For generalized zero-shot intent classification, we randomly take 70% samples of each seen intent as the training set, and the remaining 30% samples of each seen intent and all the samples of unseen intents as the test set.

Baselines
We compare ReCapsNet-ZS with the following state-of-the-art zero-shot learning methods: De-ViSE (Frome et al., 2013), CMT (Socher et al., 2013), CDSSM (Chen et al., 2016), Zero-shot DNN (Kumar et al., 2017) and IntentCapsNet (Xia et al., 2018). To make DeViSE and CMT suit-  Table 4: Results of generalized zero-shot intent classification. "Seen", "Unseen" and "Overall" respectively denote the performance on the utterances from seen intents, unseen intents, and both seen and unseen intents.  able for intent classification, we use a multi-head self-attention Bi-LSTM model to encode the utterances and then feed the final hidden states to their zero-shot learning models. In addition, we conduct ablation study to evaluate the contribution of each module of our ReCapsNet-ZS. "ReCapsNet-ZS-Dim" refers to the model that only uses the dimensional attention mechanism, and "ReCapsNet-ZS-TM" refers to the one that only uses the proposed transformation matrix construction method.

Implementation Details
Parameter Settings. For SNIPS-NLU, we use 300-dim embeddings pre-trained on English Wikipedia (Bojanowski et al., 2017). For SMP-2018, we use 300-dim Chinese word embeddings pre-trained by Li et al. (2018). The main network structure hyperparameters are shown in Table 2. In addition, for the zero-shot classification setting, we set α to 1 for SNIPS-NLU and 10 for SMP-2018 respectively. For the generalized zero-shot classification setting, we set α to 1 for SNIPS-NLU and 5 for SMP-2018 respectively. To avoid overfitting, we use dropout with 0.5 dropout rate on the input of the attention layer. For the loss function, we set λ = 0.5, m + = 0.9, m − = 0.1, β = 0.001, and use the Adam optimizer (Kingma and Ba, 2015) with initial learning rate 0.01.
Evaluation Metrics. We adopt two widely used metrics: accuracy (Acc) and micro-average F1measure (F1) to evaluate the classification performance. Both metrics are computed with the average value weighted by the support of each class, where the support means the sample ratio of the corresponding class.

Result Analysis
Zero-shot Intent Classification. Table 3 summarizes the average results over 10 runs, where the top 2 results are highlighted in bold. The baseline results on SNIPS-NLU are taken from Xia et al. (2018). The results show that ReCapsNet-ZS outperforms all the baselines, demonstrating its superiority in tackling zero-shot intent classification. We can also see that ReCapsNet-ZS performs better than either ReCapsNet-ZS-Dim or ReCapsNet-ZS-TM, which shows the effectiveness of both of the dimensional attention mechanism and the transformation matrix construction method.
Generalized Zero-shot Intent Classification. Table 4 shows the average results over 10 runs, where the top 2 results are highlighted in bold. We can make the following observations. 1) The performances look much worse than the standard zero-shot setting, but it is not surprising since the seen intent labels are included in the test phase and it makes the problem harder. 2) ReCapsNet-ZS sometimes performs slightly worse than the baselines in detecting seen intents, which is because some baselines tend to classify the test utterances as seen intents, which on the other hand explain- s why they perform much worse in detecting unseen intents. 3) ReCapsNet-ZS and ReCapsNet-ZS-TM perform much better than others in detecting unseen intents, whereas IntentCapsNet and ReCapsNet-ZS-Dim both have 0% Acc and F1. This verifies that the proposed transformation matrix construction method has much better generalization ability in detecting unseen intents. 4) Overall, ReCapsNet-ZS consistently performs the best, which further demonstrates the superiority of ReCapsNet-ZS on the generalized zero-shot intent classification tasks.

Visualization
Dimensional Attentions. Figure 2 visualizes the attention score for each dimension of the same word "book" in two different utterances (contexts) by heatmaps. The utterances are "Book a restaurant in Michigan for 4 people" and "Give 4 out of 6 points to this book", which are taken from SNIPS-NLU. It can be seen that for the two utterances the attention values of the word "book" exhibit completely different patterns, which makes sense as it contains different meanings in different contexts. Furthermore, for each utterance, the attention scores of "book" on different dimensions are also quite different. This shows that the dimensional attention mechanism can effectively capture the semantic differences of the same word in different contexts and encode more useful information than the traditional self-attention method, and thus helps to alleviate the polysemy problem.
Similarity Scores. Figure 3 visualizes the similarity scores between the unseen intent "movie" and the seen intents learned via metric learning by IntentCapsNet and ReCapsNet-ZS respectively on SMP-2018. It can be seen that IntentCapsNet can only discover few connections between the unseen and seen intents. In contrast, ReCapsNet-ZS can detect a lot more connections between them. Further, though the similarity scores of ReCapsNet- Figure 3: Comparison of the similarity scores between the unseen intent "movie" and the seen intents learned by IntentCapsNet and ReCapsNet-ZS (ours).

Conclusion
In this paper, we have proposed a novel framework to reconstruct capsule networks for zero-shot intent classification and demonstrated empirically that it compares favourably with existing methods on some real dialogue datasets. The performance gains of our method come from two aspects: the introduction of a new dimensional attention module to capsule networks for feature extraction and the proposal of a new transformation scheme for detecting unseen intents.
There are several directions of the future works. One is to customize our model for few-shot intent classification. Another is to extend our framework to deal with multiple-intent classification. We also plan to apply our model in dialogue systems for low-resource languages such as Cantonese.