A Generative Attentional Neural Network Model for Dialogue Act Classification

We propose a novel generative neural network architecture for Dialogue Act classification. Building upon the Recurrent Neural Network framework, our model incorporates a novel attentional technique and a label to label connection for sequence learning, akin to Hidden Markov Models. The experiments show that both of these innovations lead our model to outperform strong baselines for dialogue act classification on MapTask and Switchboard corpora. We further empirically analyse the effectiveness of each of the new innovations.


Introduction
Dialogue Act (DA) classification is a sequenceto-sequence learning task where a sequence of utterances is mapped into a sequence of DAs. Some works in DA classification treat each utterance as an independent instance (Julia et al., 2010;Gambäck et al., 2011), which leads to ignoring important long-range dependencies in the dialogue history. Other works have captured inter-utterance relationships using models such as Hidden Markov Models (HMMs) (Stolcke et al., 2000;Surendran and Levow, 2006) or Recurrent Neural Networks (RNNs) (Kalchbrenner and Blunsom, 2013;Ji et al., 2016), where RNNs have been particularly successful.
In this paper, we present a generative model of utterances and dialogue acts which conditions on the relevant part of the dialogue history. To this effect, we use the attention mechanism  developed originally for sequence-tosequence models, which has proven effective in Machine Translation Luong et al., 2015) and DA classification (Shen and Lee, 2016). The intuition is that different parts of an input sequence have different levels of importance with respect to the objective, and this mechanism enables the selection of the important parts. However, the traditional attention mechanism suffers from the attention-bias problem (Wang et al., 2016), where the attention mechanism tends to favor the inputs at the end of a sequence. To address this problem, we propose a gated attention mechanism, where the attention signal is represented as a gate over the input vector.
In addition, when generating a dialogue act, we capture its direct dependence on the previous dialogue act -a reasonable source of information, which, surprisingly, has not been explored in the RNN literature for DA classification.
Our experiments show that our model significantly outperforms variants that do not have our innovations, i.e., the gated attention mechanism and direct label-to-label dependency.

Model Description
Assume that we have a training dataset D comprising a collection of dialogues, where each dialogue consists of a sequence of utterances {y t } T t=1 and the corresponding sequence of dialogue acts {z t } T t=1 . Each utterance y t is a sequence of tokens, and its n-th token is denoted y t,n .
We propose a generative neural model for dialogue P Θ Θ Θ (y 1:T , z 1:T ), which specifies a joint probability distribution over a sequence of utterances y 1:T and the corresponding sequence of dialogue acts z 1:T . This generative model is then trained discriminatively by maximising the conditional log-likelihood where Θ Θ Θ represents all neural network parameters. Discriminative training is employed in order to match the use of the model for predicting dialogue acts during test time, using arg max z 1:T P Θ Θ Θ (z 1:T |y 1:T ).
The generative story of our model is as follows: (1) generate the dialogue act of the current dialogue turn conditioned on the previous dialogue act and the previous utterance P Θ Θ Θ (z t |z t−1 , y t−1 ); and (2) generate the current utterance conditioned on the previous utterance and the current dialogue act P Θ Θ Θ (y t |z t , y t−1 ). In other words, P Θ Θ Θ (z 1:T , y 1:T ) is decomposed as: T t=1 P Θ Θ Θ (z t |z t−1 , y t−1 )P Θ Θ Θ (y t |z t , y t−1 ).
(1) Furthermore, each utterance is generated by a sequential process whereby each token y t,n is conditioned on all the previously generated tokens y t,<n , as well as the external conditioning context consisting of the dialogue act z t and the previous turn's utterance y t−1 , i.e., (2) Importantly, the decomposition of the joint distribution in Equation 1 allows dynamic programming for exact decoding ( §2.2). One possible extension of our framework is to investigate a higher-order Markov model, although one needs to be conscious about the trade-off between the increase in the computational complexity of training/decoding with higher-order Markov models versus the potential gain in classification quality.
We now turn our attention to the neural architecture used to realise the components of our probabilistic model (Figure 1). We define the neural model for the conditional probability of the next dialogue act as follows: where c t is the context vector summarising the information from the previous utterance y t−1 , and are the softmax parameter gated on the previous dialogue act z t−1 . Due to gating, the number of parameters of the model may increase significantly; therefore, we have also explored a variant where only the bias term b (z t−1 ) z is gated. We define the neural model for generating the tokens of the current utterance as follows: where the weight matrix W (zt) hy is gated based on z t , c t summarises the previous utterance, and h t,n−1 is the state of an utterance-level RNN summarising all the previously generated tokens: where E E E y t,n−1 provides the embedding of the token y t,n−1 from the embedding table E E E, and f f f can be any non-linear function, i.e., the simple sigmoid applied to elements of a vector, or the more complex Long-Short-Term-Memory unit (LSTM) (Graves, 2013;Hochreiter and Schmidhuber, 1997), or the Gated-Recurrent-Unit (GRU) (Chung et al., 2014;.
In what follows, we elaborate on how to best summarise the information from the previous utterance in c t , and how to decode for the best sequence of dialogue acts given a trend model. 525

The Gated Attention Mechanism
Given a sequence of words in an utterance {y 1 , . . . , y n }, we would like to compress its information in c, which is then used in the conditioning contexts of other components of the model. Typically, the last hidden state of the utterance-level RNN is taken to be the summary vector: c = h n . However, it has been shown that attending to all RNN states is more effective.
The traditional attention mechanism  employs a probability vector a over the words of the input utterance to summarise it. The attention elements in a are typically calculated from the current input y n , and the previous hidden state h n−1 : where g is a non-linear function. Once the attention is defined, the representation of the input is constructed as c = n a n h h h n .
The problem with this traditional attention model is that the final hidden state is a function of all the inputs, hence it is usually more "informative" than the earlier hidden states due to semantic accumulation (Wang et al., 2016). Thus, most of the attention signal is assigned to the hidden states toward the end of a sequence. In DA classification, this may not be desirable, since an important token with respect to a dialogue act can appear anywhere in an utterance. We call this the attention bias problem.
We propose a novel gated attention mechanism, which is inspired by the gating mechanism in LSTMs, to fix the attention bias problem. Similar to the forget gate of LSTMs, we use the available information to calculate an attention gate that learns whether to allow the whole input signal to pass through or to forget all or a part of the input signal: a n = g g g(h n−1 , E E E yn ) x n = a n E E E yn (8) where represents element-wise multiplication. After filtering the important signal from the input token, the information from our tokens is accumulated in the last hidden state of the RNN, which we take as the summary vector c = h h h n . Note that since the gated attention is applied to the input before the RNN calculations, it is not affected by the attention bias.

Inference: Viterbi Decoding
For prediction, we choose the sequence of dialogue acts with the highest posterior probability: arg max z 1:T P Θ Θ Θ (z 1:T |y 1:T ) = arg max z 1:T P Θ Θ Θ (z 1:T , y 1:T ) Since the joint probability is decomposed further according to Equation 1, we can make use of dynamic programming to find the highest probability sequence of dialogue acts. Specifically, the model endows each latent variable z t with a unary potential P Θ Θ Θ (y t |z t , y t−1 ) and binary potential P Θ Θ Θ (z t |z t−1 , y t−1 ) functions. P Θ Θ Θ (y t |z t , y t−1 ) and P Θ Θ Θ (z t |z t−1 , y t−1 ) are akin to the emission and transition functions of an HMM, and are calculated using Equations 2 and 3 respectively. Furthermore, the model has been carefully designed so that the hidden states in the RNNs encoding the utterances to form the context vector c t (the representation of the previous utterance) are not affected by the sequence of dialogue acts, which is crucial to making the inference amenable to dynamic programming. The resulting inference algorithm is akin to the Viterbi algorithm for HMMs.

Experiments
Datasets. We conduct our experiments on the MapTask and Switchboard corpora. The MapTask Dialog Act corpus (Anderson et al., 1991) consists of 128 conversations and more than 27000 utterances in an instruction-giving scenario. There are 13 DA types in this corpus. For the experiments, the available data is split into three parts, train/test/validation with 103, 13 and 12 conversations respectively.
The Switchboard Dialog Act corpus (Jurafsky et al., 1997) consists of 1155 transcribed telephone conversations with around 205000 utterances. In contrast with the MapTask conversations, which are task-oriented, the Switchboard corpus consists mostly of general topic conversations. Baselines. On MapTask, to the best of our knowledge, there is no standard data split, thus, we make the comparison against our implementation of strong baselines such as HMM-trigram (Stolcke et al., 2000) and instance-based random forest classifier (1/2/3-gram features). Ji et al.'s (2016) results for this corpus are obtained by running their publicly available code with the same hyper parameters as those used by our models. We also report the results of Julia et al. (2010) 2 and Surendran et al. (2006). However, the experimental setup of these two works differs from ours, hence their results are not directly comparable to ours.
On Switchboard, we compare our results with strong baselines using the experimental setup from Kalchbrenner and Blunsom (2013) and Stolcke et al. (2000). 3 Our Model Configurations. We experiment with several variants of our model to explore the effectiveness of our two improvements: the HMM-like connection and the gated attention mechanism. For the HMM connection, we consider three choices: gating all parameters (Equation 3), gating only the bias, and no connection. For the attention, we consider three choices: our new gated attention mechanism, the traditional attention, and no attention. Thus, in total, we explore nine model variants.
All the model variants are implemented with the CNN package 4 and trained with Adagrad (Duchi et al., 2011) using dropout (Srivastava et al., 2014). They share the same word-embedding size (128) and hidden vector size (64). 5

Models Accuracy
Results and Analysis. Table 1 shows the classification accuracy of the nine variants of our model on the MapTask corpus. The classification accuracy of the two best variants of our model and the baselines appears in Tables 2 and 3 for MapTask and Switchboard respectively. The bold numbers in each table show the best accuracy achieved by the systems. As seen in these tables, our best models outperform strong baselines for both corpora. 6 Table 1 shows that adding the attention mechanism is beneficial, as the traditional attention models always outperform their non-attention counterparts. The gated attention configurations, in turn, outperform those with the traditional attention mechanism by 0.49%-1.21%. Interestingly, the accuracy of Shen and Lee's (2016) classifier, which employs an attention mechanism, is lower than that obtained by Kalchbrenner and Blunsom (2013), whose mechanism does not use attention. We believe that the difference in performance is not due to the attention mechanism being ineffective, but because Shen and Lee (2016) treat the classification of each utterance independently. In contrast, Kalchbrenner and Blunsom (2013) take the sequential nature of dialog acts into account, and run an RNN across the conversation, which conditions the generation of a dialogue act on the dialogue acts and utterances in all the previous dialogue turns.
As seen in Table 1, the performance gain from the HMM connection is larger than the gain from the attention mechanism. Without the attention mechanism, the HMM connection brings an increase of 3.63% with the gated bias HMM configuration and 2.58% with the fully gated HMM configuration. With the use of traditional attention, the improvement is 3.01% for the bias HMM configuration and 3.47% for the gated HMM configuration. Finally with the gated attention in place, the two HMM configurations improve the accuracy by 3.73%.
We used McNemar's test to determine the statistical significance between the predictions of different models, and found that our model with both innovations (HMM connections and gated attention) is statistically significantly better than the variant without these innovations with α < 0.01.

Conclusions
In this work, we have proposed a new gated attention mechanism and a novel HMM-like connection in a generative model of utterances and dialogue acts. Our experiments show that these two innovations significantly improve the accuracy of DA classification on the MapTask and Switchboard corpora. In the future, we plan to apply these two innovations to other sequence-tosequence learning tasks. Furthermore, DA classification itself can be seen as a preprocessing step in a dialogue system's pipeline. Thus, we also plan to investigate the effect of improvements in DA classification on the downstream components of a dialogue system.