Zero-shot User Intent Detection via Capsule Neural Networks

User intent detection plays a critical role in question-answering and dialog systems. Most previous works treat intent detection as a classification problem where utterances are labeled with predefined intents. However, it is labor-intensive and time-consuming to label users’ utterances as intents are diversely expressed and novel intents will continually be involved. Instead, we study the zero-shot intent detection problem, which aims to detect emerging user intents where no labeled utterances are currently available. We propose two capsule-based architectures: IntentCapsNet that extracts semantic features from utterances and aggregates them to discriminate existing intents, and IntentCapsNet-ZSL which gives IntentCapsNet the zero-shot learning ability to discriminate emerging intents via knowledge transfer from existing intents. Experiments on two real-world datasets show that our model not only can better discriminate diversely expressed existing intents, but is also able to discriminate emerging intents when no labeled utterances are available.


Introduction
With the increasing complexity and accuracy of speech recognition technology, companies are striving to deliver intelligent conversation understanding systems as people interact with software agents that run on speaker devices or smart phones via natural language interface (Hoy, 2018). Products like Apple's Siri, Amazon's Alexa and Google Assistant are able to interpret human speech and respond them via synthesized voices.
With recent developments in deep neural networks, user intent detection models (Hu et al., 2009;Xu and Sarikaya, 2013;Zhang et al., 2016;Liu and Lane, 2016;Chen et al., 2016b) are proposed to classify user intents given their diversely * Indicates Equal Contribution expressed utterances in the natural language. The decent performances on intent detection usually come with deep neural network classifiers optimized on large-scale utterances which are humanlabeled among existing predefined user intents.
As more features and skills are being added to devices which expand their capabilities to new programs, it is common for voice assistants to encounter the scenario where no labeled utterance of an emerging user intent is available in the training data, as illustrated in Figure 1. Current intent detection methods train classifiers in a supervised fashion and they are good at discriminating existing intents such as Get Weather and Play Music whose labeled utterances are already available. However, these models, by the nature of designs, are incapable to detect utterances of emerging intents like AddToPlaylist and RateABook, since no labeled utterances are available. Moreover, it's labor-intensive and time-consuming to annotate utterances of emerging intents and retrain the whole intent detection model.
Thus, it is imperative to develop intent detection models with the zero-shot learning (ZSL) ability (Lampert et al., 2014;Socher et al., 2013;Changpinyo et al., 2016): the ability to expand classifiers and the intent detection space beyond the existing intents, of which we have labeled utterances during training, to emerging intents, of which no labeled utterances are available.
The research on zero-shot intent detection is still in its infancy. Previous zero-shot learning methods for intent detection utilize external resources such as label ontologies (Ferreira et al., 2015a,b) or manually defined attributes that describe intents (Yazdani and Henderson, 2015) to associate existing and emerging intents, which require extra annotation. Compatibility-based methods for zero-shot intent detection (Chen et al., 2016a;Kumar et al., 2017) Figure 1: Illustration of the proposed INTENTCAPSNET-ZSL model for zero-shot intent detection: labeled utterances with existing intents like GetWeather and PlayMusic are used to train an intent detection classifier among existing intents, in which SemanticCaps extract intepretable semantic features and DetectionCaps dynamically aggregate semantic features for intent detection using a novel routing-by-agreement mechanism. For emerging intents, INTENTCAPSNET-ZSL builds zero-shot DetectionCaps that utilize the (1) outputs of SemanticCaps, (2) the routing information on existing intents from DetectionCaps, and (3) similarities of the emerging intent label to existing intent labels to discriminate emerging intents like AddToPlayist from RateABook. Solid lines indicate the training process and dash lines indicate the zero-shot inference process.
of learning a high-quality mapping from the utterance to its intent directly, so that such mapping can be further capitalized to measure the compatibility of an utterance with emerging intents. However, the diverse semantic expressions may impede the learning of such mapping.
In this work, we make the very first attempt to tackle the zero-shot intent detection problem with a capsule-based (Hinton et al., 2011;Sabour et al., 2017) model. A capsule houses a vector representation of a group of neurons, and the orientation of the vector encodes properties of an object (like the shape/color of a face), while the length of the vector reflects its probability of existence (how likely a face with certain properties exists). The capsule model learns a hierarchy of feature detectors via a routing-by-agreement mechanism: capsules for detecting low-level features (like nose/eyes) send their outputs to high-level capsules (such as faces) only when there is a strong agreement of their predictions to high-level capsules.
The aforementioned properties of capsule models could be quite appealing for text modeling, specifically in this case, modeling the user utterance for intent detection: low-level semantic features such as the get action, time and city name contribute to a more abstract intent (GetWeather) collectively. A semantic feature, which may be expressed quite differently among users, can contribute more to one intent than others. The dynamic routing-by-agreement mechanism can be used to dynamically assign a proper contribution of each semantic and aggregate them to get an intent representation.
More importantly, we discover the potential of zero-shot learning ability on the capsule model, which is not yet widely recognized. It makes the capsule model even more suitable for text modeling when no labeled utterances are available for emerging intents. The ability to neglect the disagreed output of low-level semantics for certain intents during routing-by-agreement encourages the learning of generalizable semantic features that can be adapted to emerging intents. For each emerging intent with no labeled utterances, a Zero-shot DetectionCaps is constructed explicitly by using not only semantic features Seman-ticCaps extracted, but also existing routing agreements from DetectionCaps and similarities of an emerging intent label to existing intent labels. In summary, the contributions of this work are: • Expanding capsule neural networks to text modeling, by extracting and aggregating semantics from utterances in a hierarchical manner; • Proposing a novel and effective capsule-based model for zero-shot intent detection; • Showing and interpreting the effectiveness of our model on two real-world datasets.

Problem Formulation
In this section, we first define related concepts, and formally state the problem.
Intent. An intent is a purpose, or a goal that underlies a user-generated utterance (Watson Assistant, 2017). An utterance can be associated with one or multiple intents. We only consider the basic case that an utterance is with a single intent. However, utterances with multiple intents can be handled by segmenting them into single-intent snippets using sequential tagging tools like CRF (Lafferty et al., 2001), which we leave for future works. Intent Detection. Given a labeled training dataset where each sample has the following format: (x, y) where x is an utterance and y is its intent label, each training example is associated with one of K existing intents y ∈ Y = {y 1 , y 2 , ..., y K }. The intent detection task tries to associate an utterance x existing with its correct intent category in the existing intent classes Y . Zero-shot Intent Detection. Given the labeled training set {(x, y)} where y∈Y , the zero-shot intent detection task aims to detect an utterance x emerging which belongs to one of L emerging intents z∈Z = {z 1 , z 2 , ..., z L } where Y ∩Z = ∅.

Approach
We propose two architectures based on capsule models: INTENTCAPSNET that is trained to discriminate among utterances with existing labels, e.g. existing intents for intent detection; INTENTCAPSNET-ZSL that gives zero-shot learning ability to INTENTCAPSNET for discriminating unseen labels, i.e. emerging intents in this case. As shown in Figure 2, the cores of the proposed architectures are three types of capsules: SemanticCaps that extract interpretable semantic features from the utterance, DetectionCaps that aggregate semantic features for intent detection, and Zero-shot DetectionCaps which discriminate emerging intents.

SemanticCaps
In the original capsule model (Sabour et al., 2017), convolution-based PrimaryCaps are introduced as the first layer to obtain different vectorized features from the raw input image. While in this work, an intrinsically similar motivation is adopted to extract different semantic features from the raw utterance by a new type of capsule named SemanticCaps. Unlike the PrimaryCaps which use convolution operators with a large reception field to extract spacial-proximate features, the Seman-ticCaps is based on a bi-direction recurrent neural network with multiple self-attention heads, where each self-attention head focuses on certain part of the utterance and extracts a semantic feature that may not be expressed by words in proximity.
Given an input utterance x = (w 1 , w 2 , ..., w T ) of T words, each word is represented by a vector of dimension D W that can be pre-trained using a skip-gram language model .  INTENTCAPSNET-ZSL. During training, utterances with existing intents are fed into the SemanticCaps which output vectorized semantic features, i.e. semantic vectors. Then Detec-tionCaps combine these features into higher-level prediction vectors and output an activation vector for intent detection on each existing intent. During inference, emerging utterances take advantages of the SemanticCaps trained in INTENTCAP-SNET to extract semantic features from the utterance (shown in 1), then the vote vectors on the existing intents are transferred to emerging intents (shown in 2) using similarities between existing and emerging intents (shown in 3). The obtained activation vectors for emerging intents are used for zero-shot intent detection.
A recurrent neural network such as a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) is applied to sequentially encode the utterance into hidden states: (1) For each word w t , we concatenate each forward hidden state h t obtained from the forward LSTM f w with a backward hidden state ← h t from LSTM bw to obtain a hidden state h t for the word w t . The whole hidden state matrix can be defined as Inspired by the success of self-attention mechanisms (Vaswani et al., 2017;Lin et al., 2017) for sentence embedding, we adopt a multi-head self-attention framework where each self-attention head is encouraged to be attentive to a specific semantic feature of the utterance, such as certain sets of keywords or phrases in the utterance: one selfattention may be attentive for the "get" action in GetWeather, while another one may be attentive to city name in GetWeather: it decides for itself what semantics to be attentive to.
A self-attention weight matrix A is computed as: where W s1 ∈ R D A ×2D H and W s2 ∈ R R×D A are weight matrices for the self-attention. D A is the hidden unit number of self-attention and R is the number of self-attention heads. The softmax function makes sure for each self-attention head, the attentive scores on all the words sum to one. A total number of R semantic features are extracted from the input utterance, each from a separate self-attention head: Each semantic vector will have a distinguishable orientation when the objective is properly regularized (details in Equation 6), as we want each attention to be attentive to a unique semantic feature of the utterance. The vector representation adopted in capsules is suitable to portray the lowlevel semantic properties as well as high-level intents of the utterance, where the orientation of a vector represents semantic/intent properties that may slightly vary depending on the expressions. The capsule encourages the learning of generalizable semantic vectors: less informative semantic properties for one intent may not be penalized by their orientations: they simply possess small norms as they are less likely to exist.

DetectionCaps
The output of SemanticCaps are low-level vector representations of R different semantic features extracted from the utterances. To combine these features into higher-level representations, we build DetectionCaps that choose different semantic features dynamically so as to form an intent representation for each intent via an unsupervised routingby-agreement mechanism.
As a semantic feature may contribute differently in detecting different intents, the DetectionCaps first encode semantic features with respect to each intent: where k ∈ {1, 2, ..., K}, r ∈ {1, 2, ..., R}. W k,r ∈ R 2D H ×D P is the weight matrix of the De-tectionCaps, p k|r is the prediction vector of the rth semantic feature of an existing intent k, and D P is the dimension of the prediction vector. Dynamic Routing-by-agreement. The prediction vectors obtained from SemanticCaps route dynamically to DetectionCaps. The Detection-Caps computes a weighted sum over all prediction vectors: where c kr is the coupling coefficient that determines how informative, or how much contribution the r-th semantic feature is to the intent y k . c kr is calculated by an unsupervised, iterative dynamic routing-by-agreement algorithm (Sabour et al., 2017), which is briefly recalled in Algorithm 1. As shown in this algorithm, b kr is the initial logit representing the log prior probability that a SemanticCap r is coupled to an DetectionCap k.
Algorithm 1 Dynamic routing algorithm 1: procedure DYNAMIC ROUTING(p k|r , iter) 2: for all semantic capsule r and intent capsule k: b kr ← 0. Return v k 10: end procedure The squashing function squash(·) is applied on s k to get an activation vector v k for each existing intent class k: where the orientation of the activation vector v k represents intent properties while its norm indicates the activation probability. The dynamic routing-by-agreement mechanism assigns low c kr when there is inconsistency between p k|r and v k , which ensures the outputs of the SemanticCaps get sent to appropriate subsequent DetectionCaps.
Max-margin Loss for Existing Intents. The loss function considers both the max-margin loss on each labeled utterance, as well as a regularization term that encourages each self-attention head to be attentive to a different semantic feature of the utterance: where [[]] is an indicator function, y is the ground truth intent label for the utterance x, λ is a downweighting coefficient, m + and m − are margins. α is a non-negative trade-off coefficient that encourages the discrepancies among different attention heads.

Zero-shot DetectionCaps
To detect emerging intents effectively, Zero-shot DetectionCaps are designed to transfer knowledge from existing intents to emerging intents. Knowledge Transfer Strategies. As Semantic-Caps are trained to extract semantic features from utterances with various existing intents, a selfattention head which has similar extraction behavior among existing and emerging intents may help transfer knowledge. For example, a self-attention head that extracts the "play" action mentioned by turn on/I want to hear in the beginning of an utterance for PlayMusic is helpful if it is also attentive to expressions for the "add" action like add/I want to have in the beginning of an utterance with an emerging intent AddtoPlaylist. The coupling coefficient c kr learned by Detec-tionCaps in a totally unsupervised fashion embodies rich knowledge of how informative r-th semantic is to the existing intent k. We can capitalize on the existing routing information for emerging intents. For example, how the word play routes to GetWeather can be helpful in routing the word add to AddtoPlaylist.
The intent labels also contain knowledge of how two intents are similar with each other. For example, an emerging intent AddtoPlaylist can be closer to one existing intent PlayMusic than GetWeather due to the proximity of the embedding of Playlist to Play or Music, than Weather.
Build Vote Vectors. As the routing information and the semantic extraction behavior are strongly coupled (c kr is calculated by p k|r iteratively in Line 4-6 of Algorithm 1) and their products are summarized to get the activation vector v k for in-tent k (Line 5-6 of Algorithm 1), we denote vectors before summation as vote vectors: where g k,r is the r-th vote vector for an existing intent k. Zero-shot Dynamic Routing. The zero-shot dynamic routing utilizes vote vectors from existing intents to build intent representations for emerging intents via a similarity metric between existing intents and emerging intents.
Since there are K existing intents and L emerging intents, the similarities between existing and emerging intents form a matrix Q∈R L×K . Specifically, the similarity between an emerging intent z l ∈Z and an existing intent y k ∈Y is computed as: where d (e z l , e y k ) = (e z l − e y k ) T Σ −1 (e z l − e y k ) .
(9) e z l , e y k ∈ R D I ×1 are intent embeddings computed by the sum of word embeddings of the intent label. Σ models the correlations among intent embedding dimensions and we use Σ = σ 2 I. σ is a hyper-parameter for scaling. The prediction vectors for emerging intents are thus computed as: We feed the prediction vector n l to Algorithm 1 and derive activation vectors n l on emerging intents as the output. The final intent representation n l for each emerging intent is updated toward the direction where it coincides with representative votes vectors. We can easily classify the utterance of emerging intents by choosing the activation vector with the largest normẑ = arg max z l ∈Z n l .

Experiment Setup
To demonstrate the effectiveness of our proposed models, we apply INTENTCAPSNET to detect existing intents in an intent detection task, and use INTENTCAPSNET-ZSL to detect emerging intents in a zero-shot intent detection task. Datasets. For each task, we evaluate our proposed models by applying it on two real-word  Baselines. We first compare the proposed capsulebased model INTENTCAPSNET with other text classification alternatives on the detection of existing intents: 1) TFIDF-LR/TFIDF-SVM: we use TF-IDF to represent the utterance and use logistic regression/support vector machine as classifiers. 2) CNN: a convolutional neural network (Kim, 2014) that uses convolution and pooling operations, which is popular for text classification. 3) RNN/GRU/LSTM/BiLSTM: we adopt different types of recurrent neural networks: the vanilla recurrent neural network (RNN), gated recurrent unit (GRU) (Tang et al., 2015), long short-term memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), and bi-directional long short-term memory (Bi-LSTM) (Schuster and Paliwal, 1997). Their last hidden states 1 https://github.com/snipsco/nlu-benchmark/ are used for classification. 4) Self-Attention Bi-LSTM: we apply a Bi-LSTM model with selfattention mechanism (Lin et al., 2017) and the output sentence embedding is used for classification. We also compare our proposed model INTENTCAPSNET-ZSL with different zeroshot learning strategies: 1) DeViSE (Frome et al., 2013) finds the most compatible emerging intent label for an utterance by learning a linear compatibility function between utterances and intents; 2) CMT (Socher et al., 2013) introduces non-linearity in the compatibility function; CMT and DeViSE are originally designed for zero-shot image classification based on pretrained CNN features. We use LSTM to encode the utterance and adopt their zero-shot learning strategies in our task; 3) CDSSM (Chen et al., 2016a) uses CNN to extract character-level sentence features, where the utterance encoder shares the weights with the label encoder; 4) Zero-shot DNN (Kumar et al., 2017) further improves the performance of CDSSM by using separate encoders for utterances and intent. The proposed model INTENTCAPSNET-ZSL can be seen as a hybrid model: it has the advantages of the compatibility models to model the correlations between utterances and intents directly; it also explicitly derives intent representations for emerging intents without labeled utterances.   Table 4: Zero-shot intention detection results using INTENTCAPSNET-ZSL on two datasets. All the metrics (Accuray, Precision, Recall and F1) are reported using the average value weighted by their support on per class.
rameters. The dimension of the prediction vector D P is 10 for both datasets. D I = D W because we use the averaged word embeddings contained in the intent label as the intent embedding. An additional input dropout layer with a dropout keep rate 0.8 is applied to the SNIPS-NLU dataset. In the loss function, the down-weighting coefficient λ is 0.5, margins m + k and m − k are set to 0.9 and 0.1 for all the existing intents. The iteration number iter used in the dynamic routing algorithm is 3. Adam optimizer (Kingma and Ba, 2014) is used to minimize the loss.

Results
Quantitative Evaluation. The intention detection results on two datasets are reported in Table 1, where the proposed capsule-based model INTENT-CAPSNET performs consistently better than bagof-word classifiers using TF-IDF, as well as various neural network models designed for text classification. These results demonstrate the novelty and effectiveness of the proposed capsule-based model INTENTCAPSNET in modeling text for intent detection.
Also, we report results on zero-shot intention detection task in Table 4, where our model INTENTCAPSNET-ZSL outperforms other baselines that adopt different zero-shot learning strategies. CMT has higher precision but low accuracy and recall on the SNIPS-NLU dataset. CDSSM fails on CVA dataset, probabily because the character-level model is suitable for English corpus but not for CVA, which is in Chinese. Ablation Study. To study the contribution of different modules of INTENTCAPSNET-ZSL for zero-shot intent detection, we also report ablation test results in Table 4. "w/o Self-attention" is the model without self-attention: the last forward/backward hidden states of the bi-LSTM recurrent encoder are used; "w/o Bi-LSTM" uses the LSTM with only a forward pass; "w/o Regularizer" does not encourage discrepancies among different self-attention heads: it adopts α = 0 in the loss function. Generally, from the lower part of Table 4 we can see that all modules contribute to the effectiveness of the model. On the SNIPS-NLU dataset, each of the three modules has a comparable contribution to the whole model (around 2-3% improvement in F1 score). While on the CVA dataset, the self-attention plays the most important role, which gives the model a 5.2% improvement in F1 score. Discriminative Emerging Intent Representations. Besides quantitative evidences supporting the effectiveness of the INTENTCAPSNET-ZSL, we visualize activation vectors of emerging intents in Figure 3. Since the activation vectors of utterances with emerging intents are of high dimension and we are interested in their orientations which indicate their intent properties, t-SNE is applied on the normal vector of the activation vectors to reduce the dimension to 2. We color the utterances according to their ground-truth emerging intent labels. As illustrated in Figure 3, INTENTCAPSNET-ZSL has the ability to learn discriminative intent representations for emerging intents in zero-shot DetectionCaps, so that utterances with different intents naturally have different orientations. In the meanwhile, utterances of the same emerging intent but with nuances in expressions result in their proximity in the t-SNE space. However, we do observe less satisfied cases where the model mistake an emerging intent DecreaseScreenBrightness (No. 9) with ReduceFontSize (No. 10) and SetColdColor (No. 11). When we check activation vectors of intents in Figure 3 we also find that these three intents tend to have similar representations around the area (15, -5). We think it is due to their inherent similarity as these three intents all try to tune display configurations.

Interpretability
Capsule models try to bring more interpretability when compared with traditional deep neural networks. We provide case studies here toward the intepretability of the proposed model in 1) extracting meaningful semantic features and 2) transferring knowledge from existing intents to emerging intents. Extracting Meaningful Semantic Features. To show that SemanticCaps have the ability to extract meaningful semantic features from the utterance, we study the self-attention matrix A within the Se-manticCaps and visualize the attention scores of utterances on both existing and emerging intents.  From Table 5 we can see that each self-attention head almost always focuses on one unique semantic feature of the utterance. For example, in the intent of PlayMusic one self-attention head always focuses on the "play" action while another attention focuses on musician names. We also observe that the learned attention adopts well to diverse expressions. For example, the self-attention head in PlayMusic is attentive to various mentions of musician names when they are followed by words like by, play and artist, even when named entities are not tagged and given to the model. The self-attention head that extracts the "search" action in SearchCreativeWork is able to be attentive to various expressions such as find, looking for and show. Extraction-behavior Transfer by Semantic-Caps. More importantly, we observe appealing extraction behaviors of SemanticCaps on utterances of emerging intents as well, even if they are not trained to perform semantic extraction on utterances of emerging intents.
Emerging Intent: RateBook • Rate Action i d rate this novel a five add the rating for this current series a four out of points i give ruled britannia a rating of five out of • Book Name give the televised morality series a one i want to give the coming of the terraphiles a rating of the chronicle charlie peace earns stars from me • Rating Score rate the grisly wife three points out of five i would give this current chronicle three points this saga deserves a score of four Emerging Intent: AddToPlaylist • Song/Artist Name add star light star bright to my jazz classics playlist i want a song by john schlitt in the bajo las estrellas playlist put sungmin into my summer playlist • Playlist Name add an album to my list la mejor msica dance can you add danny carey to my masters of metal playlist i want to put a copy of this tune into skatepark punks From Table 6 we observe that the same selfattention head that extracts "play" action in the existing intent PlayMusic is also attentive to words or phrases referring to the "rate" action in an emerging intent RateABook: like rate, add the rating, and give. Other self-attention heads are almost always focusing on other aspects of the utterances such as the book name or the actual rating score.
Such behavior not only shows that Seman-ticCaps have the capacity to learn an intentindependent semantic feature extractor, which extracts generalizable semantic features that either existing or emerging intent representations are built upon, but also indicates that SemanticCaps has the ability to transfer extraction behaviors among utterances of different intents. Knowledge Transfer via Intent Similarity. Beside extracting semantic features and utilizing existing routing information, we use similarities between intent embeddings to help trans-fer vote vectors from INTENTCAPSNET to INTENTCAPSNET-ZSL. We study the similarity distribution of each emerging intents to all existing intents in Figure 4. The y axis is the zero-shot detection accuracy on each emerging intent in the CVA dataset. The x axis measures var(q l ), the variance of the similarity distribution of each emerging intent l to all the existing intents. If an emerging intent has a high variance in the similarity distribution, it means that some existing intents have higher similarities with this emerging intent than others: the model is more certain about which existing intent to transfer the similarity knowledge from, based on intent label similarities. In this case, 13 out of 20 emerging intents with high variances where var(q l ) > 0.005 always have a decent performance (Accuracy 0.83). While a low variance does not necessarily always lead to less satisfied performances as some intents can rely on existing intents more evenly together, but with less confidence on each, for knowledge transfer.