A Self-Attentive Model with Gate Mechanism for Spoken Language Understanding

Spoken Language Understanding (SLU), which typically involves intent determination and slot filling, is a core component of spoken dialogue systems. Joint learning has shown to be effective in SLU given that slot tags and intents are supposed to share knowledge with each other. However, most existing joint learning methods only consider joint learning by sharing parameters on surface level rather than semantic level. In this work, we propose a novel self-attentive model with gate mechanism to fully utilize the semantic correlation between slot and intent. Our model first obtains intent-augmented embeddings based on neural network with self-attention mechanism. And then the intent semantic representation is utilized as the gate for labelling slot tags. The objectives of both tasks are optimized simultaneously via joint learning in an end-to-end way. We conduct experiment on popular benchmark ATIS. The results show that our model achieves state-of-the-art and outperforms other popular methods by a large margin in terms of both intent detection error rate and slot filling F1-score. This paper gives a new perspective for research on SLU.


Introduction
One long-term goal in artificial intelligence field is to build an intelligent human-machine dialogue system, which is capable of understanding human's language and giving smooth and correct responses. A typical dialogue system is designed to execute the following components: (i) automatic speech recognition converts a spoken query into transcription, (ii) spoken language understanding component analyzes the transcription to extract semantic representations, (iii) dialogue manager interprets the semantic information and decides the best system action, according to which the system response is further generated either as a natural language output (Jurafsky, 2000).
In this paper, we focus on spoken language understanding which is a core component of a spoken dialogue system. It typically involves two major tasks, intent determination and slot filling. Intent determination aims to automatically identify the intent of the user as expressed in natural language. Slot filling aims to extract relevant semantic constituents from the natural language sentence towards achieving a goal.
Usually, intent detection and slot filling are carried out separately. However, separate modeling of these two tasks is constrained to take full advantage of all supervised signals. Joint learning of intent detection and slot filling is worthwhile for three reasons. Firstly, the two tasks usually appear simultaneously in SLU systems. Secondly, the information of one task can be utilized in the other task to promote each other and a joint prediction can be made (Zhang and Wang, 2016). For example, if the intent of a utterance is to find a flight, it is likely to contain the departure and arrival cities, and vice versa. Lastly, slot tags and intents, as semantics representations of user behaviours, are supposed to share knowledge with each other.
Recently, joint model for intent detection and slot filling has achieved much progress. (Xu and Sarikaya, 2013) proposed using CNN based triangular CRF for joint intent detection and slot filling. (Guo et al., 2014) proposed using a recursive neural network that learns hierarchical representations of the input text for the joint task. (Liu and Lane, 2016b) describes a recurrent neural network (RNN) model that jointly performs intent detection, slot filling and language modeling. The neural network models keep updating the intent prediction as word in the transcribed utterance arrives and uses it as contextual features in the joint model.
In this work, we propose a novel model for joint intent determination and slot filling by introducing self-attention and gating mechanism. Our model can fully utilize the semantic correlation between slot and intent. To the best of our knowledge, this is the first attempt to utilize intentaugmented embedding as a gate to guide the learning of slot filling task. To fully evaluate the efficiency of our model, we conduct experiment on Airline Travel Information Systems (ATIS) dataset (Hemphill et al., 1990), which is popularly used as benchmark in related work. And empirical results show that our independent model outperforms the previous best result by 0.54% in terms of F1-score on slot filling task, and gives excellent performance on intent detection task. Our joint model further promotes the performance and achieves state-of-the-art results on both tasks. The rest of our paper is structured as follows: Section 2 discusses related work, Section 3 gives a detailed description of our model, Section 4 presents experiments results and analysis, and Section 5 summarizes this work and the future direction.

Related Work
There is a long research history for spoken dialogue understanding, which emerged in the 1990s from some call classification systems (Gorin et al., 1997) and the ATIS project. In this section, we describe some typical works on intent classification and slot-filling, which are both core tasks of SLU (De Mori, 2007).
For intent detection task, the early traditional method is to employ n-grams as features with generic entities, such as locations and dates (Zhang and Wang, 2016). This type of method is restricted to the dimensionality of the input space. Another line of popular approaches is to train machine learning models on labeled training data (Young, 2002;Hahn et al., 2011). For example, SVM (Haffner et al., 2003) and Adaboost (Schapire and Singer, 2000) have been explored to improve intent detection. Approaches based on neural network architecture have shown good performance on intent detection task. Deep belief networks (DBNs) have been first used in call routing classification (Deoras and Sarikaya, 2013). More recently, RNNs have shown excellent performance on the intent classification task (Ravuri and Stolcke, 2015). For slot-filling task, traditional approaches are based on conditional random fields (CRF) architecture, which has strong ability on sequence labelling (Raymond and Riccardi, 2007). Recently, models based on neural network and its extensions have shown excellent performance on the slot filling task and outperform traditional CRF models. For example, (Yao et al., 2013) proposed to take words as input in a standard recurrent neural network language model, and then to predict slot labels rather than words on the output side. (Yao et al., 2014b) improved RNNs by using transition features and the sequence-level optimization criterion of CRF to explicitly model dependencies of output labels. (Mesnil et al., 2013) tried bidirectional and hybrid RNN to investigate using RNN for slot filling. (Yao et al., 2014a) introduced LSTM architecture for this task and obtained a marginal improvement over RNN. Besides, following the success of attention based models in the NLP field, (Simonnet et al., 2015) applied the attention-based encoder-decoder to the slot filling task, but without LSTM cells.
Recently, there has been some work on learning intent detection and slot filling jointly exploited by neural networks. Slot labels and intents, as semantics of user behaviors, are supposed to share knowledge with each other. (Guo et al., 2014) adapted recursive neural networks (RNNs) for joint training of intent detection and slot filling. (Xu and Sarikaya, 2013) described a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network version of the triangular CRF model (Tri-CRF). (Hakkani-Tür et al., 2016) proposed a single recurrent neural network architecture that integrates the three tasks (domain detection, intent detection and slot filling for multiple domains) in a model. (Liu and Lane, 2016a) proposed an attention-based neural network model for joint intent detection and slot filling. Their joint model got the best performance of 95.98% slot filling F1-score and 1.57% intent error rate in the ATIS dataset.
Despite the great progress those methods have achieved, it is still a challenging and open task for intent detection and slot filling. Therefore, we are motivated to design a powerful model, which can improve the performance of SLU systems.

Model
In this section, we present our model for the joint learning of intent detection and slot filling. Figure  1 gives an overview of our model. The first layer maps input sequence into vectors by concatenating its word-level embeddings and character-level embeddings (obtained by convolution). And we use these vectors as merged embeddings in downstream layers. In many situations, contextual information is useful in sequence labelling. In this paper, we introduce an approach that leverages context-aware features at each time step. In particular, we make use of self-attention to produce context-aware representations of the embeddings. Then a bidirectional recurrent layer takes as input the embeddings and context-aware vectors to produce hidden states. In the last step, we propose to exploit the intent-augmented gating mechanism to match the slot label. The gate for a specific word is obtained by taking a linear transformation of the intent embedding and another contextual representation of this word computed by self-attention. We apply element-wise dot-product between the gate and each BiLSTM output.
Finally, a softmax layer is added to classify the slot labels on top of the gate layer. For simplicity, we only take the weighted average of BiLSTM outputs to predict the intent label.
The design of this structure is motivated by the effectiveness of multiplicative interaction among vectors and by self-attention mechanism which has been used successfully in a variety of tasks (Cheng et al., 2016;Vaswani et al., 2017;Lin et al., 2017). It also typically corresponds to our finding that the intent is highly correlated with slot label in some cases, so the semantics of intent should be useful for detecting the slot labels.

Embedding Layer
We first convert the indexed words w = (w 1 , w 2 , ..., w T ) to word-level embeddings E w = [e w 1 , e w 2 , ..., e w T ], and character-level embeddings . Although word embeddings are sufficient for many NLP task, provided by a well-pretrained glove 1 or word2vec 2 , character-level information provides some more prior knowledge (e.g. morphemes) to the embedding learning procedure. Some morphemic correlated words are more close in vector space, which is useful for identifying the slot labels. Character embeddings also alleviate the out-of-vocabulary (OOV) problem in the testing phase. In this paper we focus on a character-aware convolution layer used in (Kim et al., 2016) for words. The character-level embeddings are generated by convolution over characters in the word with multiple window size to extract n-gram features.
Let C be the vocabulary of characters, V be the vocabulary of words. The dimensions of character-level embedding and word-level embedding are denoted as d c and d w , respectively. For each word w t ∈ V, characters in w t constitute the matrix C t ∈ R dc×l , where the columns corresponds to l character embeddings.
A narrow convolution is applied between C t and a filter (or kernel) H ∈ R dc×w . Here we suppose the filter width is w. After that, we obtain a feature map f t ∈ R l−w+1 by adding a nonlinearity activation. The final n-gram features is generated by taking the max-over-time: where C t [:, i : i + w − 1] is the i-to-(i+w-1)-th column of C t , and the character-level embedding e c t is made up of multiple c t generated by different convolution kernels.

Self-Attention
Attention mechanism is usually used to guide the forming of sentence embedding, extra knowledge is also used to weigh the CNN or LSTM hidden states (i.e. document words sometimes attend to question information). However in slot filling task, the input to our model is just one sequence. So the attention mechanism used here is called selfattention, that is to say, the word at each time step attends to the whole words in this sentence. And it helps to determine which region is likely to be a slot. Since the embedding at each time step consists of multiple parts (i.e. word embedding and character embeddings of different kernel width), each part has its own semantic meaning. As shown in Figure 2, we divide the embedding into multiple parts and the attention of each part is processed within its corresponding dimension. In this approach, we restrict the interaction among different aspects of the embedding. We hypothesize that different semantic parts are relatively independent and play different roles in our network.
Suppose M ∈ R dm×T to be the matrix containing sentence hidden vectors [m 1 , ..., m T ], where d m is the dimension of these T vectors. Considering the characteristics of slot filing task, our aim is to encode each hidden vector into a contextaware representation. We achieve that by using attention over all the sentence hidden vectors M . Firstly, We linearly map all the vectors in M to three feature spaces by different projection parameters W a , W b and W c , so the resulting vectors are expressed as M a , M b and M c with the same shape as M . These matrices are shared across all time steps. Considering the structure of embedding which consists of K different parts (we use 4 kinds of embeddings with the same dimension), these transformed matrices are equally split into K parts. Furthermore, the attention weight is computed by dot product between M a and M b . Lastly, the attention output is a weighted sum of M c . Specifically, we consider different K parts in detail for k = 1, .., K: where M k,a ∈ R (dm/K)×T is the k-th part of M a which is transformed from M by W a . Index t is word position ranging over T time steps and m k,a,t ∈ R dm/K is the t-th column of M k,a . α k,t is the attention weights over M k,c . The output of self-attention module generated at time step t is the concatenation of K parts by using Equation 5.

BiLSTM
Character embeddings and word embeddings are both important features in our task. To further utilize these features, we associate each embedding with a context-aware representation which is typically implemented by self-attention mechanism. For current word w t , the input of the recurrent layer at time step t is represented as x t : e a t is the context-aware vector of w t which is obtained by applying self-attention mechanism on the concatenated embeddings E = [e c 1 e w 1 , ..., e c T e w T ] . It was difficult to train RNNs to capture longterm dependencies because the gradients tend to either vanish or explode. Therefore, some more sophisticated activation functions with gating units were designed. We use LSTM (Hochreiter and Schmidhuber, 1997) in this work: Where denotes element-wise product of two vectors. To consider both the previous history and the future history, we use BiLSTM as encoder in advance. The bi-directional LSTM (BiLSTM), a modification of the LSTM, consists of a forward and a backward LSTM. The encoder reads the input vectors x = (x 1 , x 2 , ..., x T ) and generates T hidden states by concatenating the forward and backward hidden states of BiLSTM: where ← − h t is the hidden state of backward pass in BLSTM and − → h t is the hidden state of forward pass in BLSTM at time t.

Intent-Augmented Gating Mechanism
As described above, intent information is useful for slot filling task. To measure the probability of words in target slots and attend to the ones relevant to the intent, we add a gate to the output of BiLSTM layer. Let H ∈ R 2d×T be a matrix consisting of hidden vectors [h 1 , ..., h T ] produced by BiLSTM. For each word, we use self-attention mechanism to form another contextaware representation, the gate vector h * t is calculated by linearly transforming the concatenation of the context-aware representation and the intent embedding vector v int with a multi-layer perceptron (MLP) network. The intent label is provided by correct label during training phase, and by the output from intent classification layer in the test phase. Specifically, for t = 1, ...T : We use element-wise multiplication to model the interaction between BiLSTM outputs and the gate vector.

Task Learning
The bidirectional recurrent layer converts a sequence of words w = (w 1 , w 2 , ..., w T ) into hidden states H = [h 1 , ..., h T ] which are shared by two tasks. We use simple attention pooling function denoted as f att over H to get an attention-sum vector for intent label classification. The classified label y int is transformed to an embedding v int by matrix E int for gate computing.
During the training phase, model parameters are updated w.r.t. a cross-entropy loss between the predicted probabilities and the true label. The label with maximum probability will be selected as the predicted intent during the testing phase. For another task, the hidden states processed by our gating layer are used for predicting slot labels.
Slot filling can be defined as a sequence labelling problem which is to map a utterance sequence w = (w 1 , ..., w T ) to its corresponding slot label sequence y = (y 1 , ..., y T ). The objective is to maximize the likelihood of a sequence: It is equal to minimize Negative Log-likelihood (NLL) of the correct labels for the predicted sequence y slot .

Dataset
In order to evaluate the efficiency of our proposed model, we conduct experiments on ATIS (Airline Travel Information Systems) dataset, which is widely used as benchmark in SLU research (Price, 1990). Figure 3 gives one example of sentence in ATIS dataset. The words are labelled with their value according to certain semantic frames. The slot labels of the words are represented in an In-Out-Begin (IOB) format and the intent is highlighted with a box surrounding it. In this paper, we use the ATIS corpus setting following previous related works (Liu and Lane, 2016a;Mesnil et al., 2015;Liu and Lane, 2015;Xu and Sarikaya, 2013;Tur et al., 2010). The training set contains 4978 utterances from ATIS-2 and ATIS-3 datasets, and test set contains 893 utterances from ATIS-3 NOV93 and DEC94 datasets. The number of slot labels is 127 and the intent has 18 different types.

Metrics
The performance of slot filling task is measured by the F1-score, while intent detection task is evaluated with prediction error rate that is the ratio of the incorrect intent of the test data.

Training Details
We preprocess the ATIS following (Yao et al., 2013;Liu and Lane, 2016a). To deal with unseen words in the test set, we mark those words that appear only once in the training set as UNK , and use this label to represent those unseen words in the test set. Besides, each number is converted to the string DIGIT.
The model is implemented in the Tensorflow framework (Abadi et al., 2016). At training stage, we use LSTM cell as suggested in (Sutskever et al., 2014) and the cell dimension d is set to be 128 for both the forward and backward LSTM.
We set the dimension of word embedding d w to be 64 and the dimension of character embedding d c to be 128. We generate three characterlevel embeddings using multiple widths and filters (the convolution kernel width w ∈ {2, 3, 4} with 64 filters each) followed by a max pooling layer over time. Then, the dimension of concatenated embeddings is 256. We make the dimensions of each parts equal for the convenience of dimension splitting during the self-attention in later stage. All the parameters in the network are randomly initialized with uniform distribution (Sussillo and Abbott, 2014) which are fine-tuned during training. We use the stochastic gradient descent algorithm (SGD) for updating parameters. And the learning rate is controlled by Adam algorithm (Kingma and Ba, 2014). The model is trained on all the training data with mini-batch size of 16. In order to enhance our model to generalize well, the maximum norm for gradient clipping is set to 5. We also apply layer normalization (Ba et al., 2016) on the self-attention layer after we add a residul connection between the output and input. Meanwhile, dropout rate 0.5 is applied on recurrent cell projection layer (Zaremba et al., 2014) and on each attention activation.

Independent Learning
The results of separate training for slot filling and intent detection are reported in Table 1 and Table 2 respectively. On the independent slot filling task, we fixed the intent information as the ground truth labels in the dataset. But on the independent intent detection task, there is no interaction with slot labels. Table 1 compares F1-score of slot filling between our proposed architecture and some previous works. Our model achieves state-of-the-art results and outperforms previous best model by 0.56% in terms of F1-score. We attribute the improvement of our model to the following reasons: 1) The attention used in (Liu and Lane, 2016a) is vanilla attention, which is used to compute the de-Methods F1-score CRF (Mesnil et al., 2013) 92.94 simple RNN (Yao et al., 2013) 94.11 CNN-CRF (Xu and Sarikaya, 2013) 94.35 LSTM (Yao et al., 2013) 94.85 RNN-SOP (Liu and Lane, 2015) 94.89 Deep LSTM (Yao et al., 2013) 95.08 RNN-EM (Peng et al., 2015) 95.25 Bi-RNN with Ranking Loss (Vu et al., 2016) 95.47 Encoder-labeler Deep LSTM (Kurata et al., 2016) 95.66 Attention BiRNN (Liu and Lane, 2016a) 95.75 BLSTM-LSTM (focus) (Zhu and Yu, 2017) 95.79 Our Model 96.35 coding states. It is not suitable for our model since the embeddings are composed of several parts. Self-attention allows the model to attend to information jointly from different representation parts, so as to better understand the utterance. 2) intentaugmented gating layer connects the semantics of sequence slot labels, which captures complex interactions between the two tasks. Table 2 compares the performance of our proposed model to previously reported results on intent detection task. Our model gives good performance in terms of classification error rate, but not as good as Attention Encoder-Decoder (with aligned inputs) method (Liu and Lane, 2016a). As their published state-of-the-art result described in (Liu and Lane, 2016a), their attention-based model is based on word-level embeddings. While in our model, we introduce character-level embeddings to improve the performance of joint learning. But independent learning for intent classification aims at capturing the global information of an utterance, not caring much about the details of specific word. The character-level embeddings introduced in our model bring very little hurt to independent learning of intent detection, as a trade-off in performance between both criterion.

Joint Learning
We compare our model against the following baseline models based on joint learning:
• Recursive NN + Viterbi: (Guo et al., 2014) applied the Viterbi algorithm on Recursive NN to improve the result on slot filling.
• Attention Enc-Dec: (Liu and Lane, 2016a) proposed Attention Encoder-Decoder (with aligned inputs) which introduced context vector as the explicit aligned inputs at each decoding step.
• Attention BiRNN: (Liu and Lane, 2016a) introduced attention to the alignment-based RNN sequence labeling model. Such attention provides additional information to the intent classification and slot label prediction.

Ablation Study
The ablation study is performed to evaluate whether and how each part of our model contributes to our full model. To further evaluate the advances of our gating architecture for joint learning, we ablate some techniques used in our model. We ablate three important components and conduct different approaches in this experiment. Note that all the variants are based on joint learning with intent-augmented gate: • W/O char-embedding, where no character embeddings are added to the embedding layer. The embedding layer is composed of word embeddings only.
• W/O self-attention, where no self-attention is modelled after the embedding layer and in the intent-augmented gating layer. The intent gate is computed by the output of BiLSTM and intent embedding.
• W/O attention-gating, where no selfattention mechanism is performed in the intent-augmented gating layer. The gate is computed by the output of BiLSTM and intent embedding. But we still use the self-attention on top of embedding layer to augment the context information. Table 4 shows the joint learning performance of our model on ATIS data set by removing one module at a time. We find that all variants of our model perform well based on our gate mechanism. As listed in the table, all features contribute to both slot filling and intent classification task. If we remove the self-attention from the holistic model or just in the intent-augmented gating layer, the performance drops dramatically. The result can be interpreted that self-attention mechanism computes context representation separately and enhances the interaction of features in the same aspect. We can see that self-attention does improve performance a lot in a large scale, which is consistent with findings of previous work (Vaswani et al., 2017;Lin et al., 2017). If we remove character-level embeddings and only use word-level embeddings, we see 0.22% drop in terms of F1-score. Though word-level embeddings represent the semantics of each word, character-level embeddings can better handle the out-of-vocabulary (OOV) problem which is essential to determine the slot labels.

Conclusion
In this paper, we propose a novel self-attentive model gated with intent for spoken language understanding. We apply joint learning on both intent detection and slot filling tasks. In our model, self-attention mechanism is introduced to better represent the semantic of utterance, and gate mechanism is introduced to make full use of the semantic correlation between slot and intent. Experiment results on ATIS dataset have shown efficiency of our model and outperforms the state-ofthe-art approach on both tasks. Besides, our model also shows consistent performance gain over the independent training models. In future works, we plan to improve our model by introducing extra knowledge.