A Hierarchical Neural Model for Learning Sequences of Dialogue Acts

We propose a novel hierarchical Recurrent Neural Network (RNN) for learning sequences of Dialogue Acts (DAs). The input in this task is a sequence of utterances (i.e., conversational contributions) comprising a sequence of tokens, and the output is a sequence of DA labels (one label per utterance). Our model leverages the hierarchical nature of dialogue data by using two nested RNNs that capture long-range dependencies at the dialogue level and the utterance level. This model is combined with an attention mechanism that focuses on salient tokens in utterances. Our experimental results show that our model outperforms strong baselines on two popular datasets, Switchboard and MapTask; and our detailed empirical analysis highlights the impact of each aspect of our model.


Introduction
The sequence-labeling task involves learning a model that maps an input sequence to an output sequence. Many NLP problems can be treated as sequence-labeling tasks, e.g., part-of-speech (PoS) tagging (Toutanova et al., 2003;Toutanova and Manning, 2000), machine translation (Brown et al., 1993) and automatic speech recognition (Gales and Young, 2008). Recurrent Neural Nets (RNNs) have been the workhorse model for many NLP sequence-labeling tasks, e.g., machine translation  and speech recognition (Amodei et al., 2015), due to their ability to capture long-range dependencies inherent in natural language.
In this paper, we propose a hierarchical RNN for labeling a sequence of utterances (i.e., contributions) in a dialogue with their Dialogue Acts (DAs). This task is particularly useful for dialogue systems, as knowing the DA of an utterance supports its interpretation, and the generation of an appropriate response. The DA classification problem differs from the aforementioned tasks in the structure of the input and the immediacy of the output. The input in these tasks is a sequence of tokens, e.g., a sequence of words in PoS tagging; while in DA classification, the input is hierarchical, i.e., a conversation comprises a sequence of utterances, each of which has a sequence of tokens ( Figure 1). In addition, to be useful for dialogue systems, the DA of an utterance must be determined immediately, hence a bi-directional approach is not feasible.
As mentioned above, RNNs are able to capture long-range dependencies. This ability was harnessed by Shen and Lee (2016) for DA classification. However, they ignored the conversational dimension of the data, treating the utterances in a dialogue as separate instances -an assumption that results in loss of information. To overcome this limitation, we designed a two-layer RNN model that leverages the hierarchical nature of dialogue data: an outer-layer RNN encodes the conversational dimension, and an inner-layer RNN encodes the utterance dimension.
One of the difficulties of sequence labeling is that different elements of an input sequence have different degrees of importance for the task at hand (Shen and Lee, 2016), and the noise introduced by less important elements might degrade the performance of a labeling model. To address this problem, we incorporate into our model the attention mechanism described in (Shen and Lee, 2016), which has yielded performance improvements in DA classification compared to traditional RNNs.
Our empirical results show that our hierarchical RNN model with an attentional mechanism out- performs strong baselines on two popular datasets: Switchboard (Jurafsky et al., 1997;Stolcke et al., 2000) and MapTask (Anderson et al., 1991). In addition, we provide an empirical analysis of the impact of the main aspects of our model on performance: utterance RNN, conversation RNN, and information source for the attention mechanism.
This paper is organised as follows. In the next section, we discuss related research in DA classification. In Section 3, we describe our RNN. Our experiments and results are presented in Section 4, followed by our analysis and concluding remarks.

Related Research
Independent DA classification. In this approach, each utterance is treated as a separate instance, which allows the application of general classification algorithms. Julia et al. (2010) employed a Support Vector Machine (SVM) with n-gram features obtained from an utterance-level Hidden Markov Model (HMM) to ascribe DAs to audio signals and textual transcriptions of the MapTask corpus. Webb et al. (2005) used a similar approach, employing cue phrases as features.
Sequence-based DA classification. This approach takes advantage of the sequential nature of conversations. In one of the earliest works in DA classification, Stolcke et al. (2000) used an HMM with a trigram language model to classify DAs in the Switchboard corpus, achieving an accuracy of 71.0%. In this work, the trigram language model was employed to calculate the symbol emission probability of the HMM. Surendran et al. (2006) also used an HMM, but employed output symbol probabilities produced by an SVM classifier, instead of emission probabilities obtained from a trigram language model. More recently, the Recurrent Convolutional Neural Network model proposed by Kalchbrenner and Blunsom (2013) achieved an accuracy of 73.9% on the Switchboard corpus. In this work, a Convolutional Neural Network encodes each utterance into a vector, which is then treated as input to a conversationlevel RNN. The DA is then classified using a softmax layer applied on top of the hidden states of the RNN.
Attention in Neural Models. Attentional Neural Models have been successfully applied to sequence-to-sequence mapping tasks, notably machine translation and DA classification. Bahdanau et al. (2014) proposed an attentional encoder-decoder architecture for machine translation. The encoder encodes the input sequence into a sequence of hidden vectors; the decoder decodes the information stored in the hidden sequence to generate the output; and the attentional mechanism is used to summarize a sentence into a context vector dynamically, helping the decoder decide which part of the sequence to attend to when generating a target word. As mentioned above, Shen and Lee (2016) employed an attentional RNN for independent DA classification; they achieved an accuracy of 72.6% on textual transcriptions of the Switchboard corpus.

Model Description
Note that our model conditions on the full history, rather than a finite history as done in Markov models, such as maximum entropy Markov models (McCallum et al., 2000). We employ neural networks to model the constituent conditional distributions. Our model comprises three main elements ( Figure 2): (1) an utterance-level RNN that encodes the information within the utterances; (2) an attentional mechanism that highlights the important parts of an input utterance, and summarizes the information within the utterance into a real-valued vector; and (3) a conversation-level RNN that encodes the information of the whole dialogue sequence. As discussed in Section 1, our hierarchical-RNN design was motivated by the structure of the input data, while the attentional mechanism has proven to be effective in DA classification (Shen and Lee, 2016).
Utterance-level RNN. This RNN was implemented using LSTM (Hochreiter and Schmidhuber, 1997;Graves, 2013). First, an embedding matrix maps each token (e.g., word or punctuation marker) into a dense vector representation. Let us denote the sequence of tokens in the t- The parameters of the utterance RNN and the token embeddings are learned during training.
Attentional mechanism. This mechanism summarizes the hidden vectors of the utterance-level RNN into a single vector representing the whole utterance. The attention vector is a sequence of positive numbers that sum to 1, where each number corresponds to a token in an utterance, and represents the importance of the token for understanding the DA associated with the utterance. The final representation z z z t of the t-th utterance is the sum of the corresponding elements of its hidden vectors weighted by attention weights: We posit that the main factors for determining the importance of a token for DA classification are: (1) the meaning of the token, as represented by its embedding vector; and (2) the full context of the conversation, particularly the previous DA. For example, if the DA of an utterance is Yes-No-Question, and there is a "yes" or "no" token in the next utterance, this token is likely to be important. Equation 5 integrates these factors to compute attention scores: where vector e a e a e a (y t−1 ) denotes the embedding of the previous DA, which is similar to the embedding of tokens; and vector g g g t−1 is the previous hidden vector of the conversation-level RNN, detailed below, which summarizes the conversation so far. W (in) and W (co) are parameter matrices for the input tokens and the conversational context respectively, and U U U and b b b (in) are parameter vectors -all of which are learned during training. The scores s i t are mapped into a probability vector by means of a softmax function: Conversation-level RNN. This RNN is structurally similar to the utterance-level RNN. The input to the conversation-level RNN is the sequence of vectors z z z generated for the utterances in a conversation, which is then encoded by the RNN into a sequence of hidden vectors g g g: This information is then used in the generation of the output DA: where the matrix W W W (out) , vector b b b (out) and the parameters of the conversation-level network RNN convers are learned during the training. During testing, ideally a given sequence of observed utterances o o o should be decoded to a label sequence y y y that maximizes the conditional probability P (y y y|o o o) according to the model. However, finding the highest-scoring label sequence is a computationally hard problem, since the conversation-level RNN does not lend itself to dynamic programming. Therefore, we employ a greedy decoding approach, where, going left-toright, at each step we choose the y t with the highest probability in the local DA distribution. This method is common practice in sequence-labeling RNNs, e.g., in neural machine translation (Bahdanau et al., 2014;Luong et al., 2015).

Data sets
We tested our models on the Switchboard corpus (Jurafsky et al., 1997;Stolcke et al., 2000) and the MapTask corpus (Anderson et al., 1991) -two popular datasets used for DA classification. At this stage of our research, we consider only transcriptions of the conversations in both corpora (the incorporation of phonetic input (Taylor et al., 1998;Wright Hastie et al., 2002;Julia et al., 2010) is the subject of future work). Thus, we compare our results only with those obtained by systems that employ transcriptions exclusively.
Switchboard corpus. This corpus contains DAannotated transcriptions of 1155 telephone conversations with no specific topic, which have an average of 176 utterances. Originally, there were approximately 226 DA tags in the corpus, but in the DA classification literature, the tags are usually clustered into 42 tags. 1 Table 1(a) shows percentages of the seven most frequent tags in the data. Following (Stolcke et al., 2000), in our experiments we use 1115 conversations for training, 21 for development and 19 for testing.
MapTask corpus. This is a richly annotated corpus that comprises 128 dialogues about instruction following, containing 212 utterances on average. Each conversation has an instruction giver and an instruction follower. The instruction giver gives directions with reference to a map, which the instruction follower must follow. The MapTask corpus has 13 DA tags, including the "unclassifiable" tag. Table 1(b) shows percentages of the seven most frequent tags in the data. We randomly split this data into 80% training, 10% development and 10% test sets, which contain 103, 12 and 13 conversations respectively.

Results
We experimented with different embedding sizes and hidden layer dimensions for our model HA-RNN, and selected the following, which yielded    the best performance with reasonable run times. The word-embedding size was set to 250, and the DA-embedding size to 180. The hidden dimension of the utterance-level RNN was set to 160, and the hidden dimension of the conversationlevel RNN was set to 250. Our model was implemented with the CNN package. 2 During training, the negative log-likelihood was optimized using Adagrad (Duchi et al., 2011), with dropout rate 0.5 to prevent over-fitting (Srivastava et al., 2014). Training terminated when the log-likelihood of the development set did not improve. As mentioned in Section 3, during testing, the sequence of output labels was generated with greedy decoding. Statistical significance was computed on the MapTask test data using McNemar's test with α = 0.05 (we could not compute statistical significance for the Switchboard results, because they were obtained from the literature, and we did not have access to per-conversation labels).
Switchboard. We compare our model's performance with that of the following strong baselines: (RCNN) the recurrent convolutional neural network model from (Kalchbrenner and Blunsom, 2013); (RNN-Attentional-C) the attentionbased RNN classifier from (Shen and Lee, 2016); and (HMM-trigram-C) the HMM-based classifier from (Stolcke et al., 2000). The results in Table 2 show that our model outperforms these baselines. 3 The higher ac-2 github.com/clab/cnn. 3 Two other works on Switchboard DA classification (Gambäck et al., 2011;Webb and Ferguson, 2010) used experimental setups that differ from ours, respectively obtaining curacy of our model compared to classifierbased approaches (i.e., RNN-Attentional-C and HMM-trigram-C) confirms that taking into account dependencies among the DAs through the conversation-level RNN improves accuracy. Furthermore, the better performance of our model compared to RCNN shows that summarizing utterances with an RNN augmented with an attention architecture is more effective than using a convolution architecture for DA sequence labeling.
MapTask. Due to the unavailability of standard training/development/test sets for this dataset, we compare the results obtained by our model with those obtained by our implementation of the following independent DA classifiers: HMMtrigram-C (Stolcke et al., 2000); Random Forest -an instance-based random forest classifier; and Random Forest + prev DA -a random forest classifier that uses the previous DA tag.
The results in Table 3 show that our model outperforms these baselines (statistically significant). These results reinforce the insights from the Switchboard corpus, whereby taking into account conversational dependencies between DAs substantially improves DA-labeling performance. 4 accuracies of 77.85% and 80.72%. However, these results are not directly comparable to Stolcke et al.'s (2000) or ours, and are therefore excluded from our comparison. 4 Two studies on MapTask DA classification were performed under experimental setups that differ from ours: Julia et al. (2010) employed HMM+SVM on text transcriptions and audio signals, obtaining an accuracy of 55.4% for transcriptions only. Surendran and Levow (2006) used Viterbi+SVM, posting a classification accuracy of 59.1% for transcriptions -the best result among systems that employ transcription data exclusively. Unfortunately, Julia et al.'s de-

Architectural analysis
We investigate the influence of the main components of our model on performance by creating variants of our model through the addition or removal of connections or layers. We then compare the performance of these variants with that of the original model in terms of DA-classification accuracy and negative log-likelihood on the test, development and training partitions of our datasets. As done in Section 4, statistical significance is calculated for the test partitions of both datasets using McNemar's test with α = 0.05.
Does an RNN at the utterance level help? To answer this question, we create a variant, denoted woUttRNN, where attentional coefficients are applied directly to the token embeddings. Thus, Equation 4 is changed to Equation 9: As seen in Tables 4 and 5, removing the utterance-level RNN (woUttRNN) reduces the accuracy and increases the negative log likelihood for the training, development and test partitions of both datasets. These changes are statistically significant for the test set.
Which sources of information are critical for computing the attentional component? In our main model, HA-RNN, we calculate the attentional signal using information from the previous DA, the previous hidden vector representation of the conversation-level RNN, and the embeddings of the tokens. To determine the contribution of the first two resources to the performance of the model, we create two variants of HA-RNN: woDA2Attn, which employs only the previous conversation-level RNN hidden vector; and woHid2Attn, which employs only the previous DA. Thus, in woDA2Attn, Equation 5 becomes Equation 10, and in woHid2Attn, Equation 5 becomes Equation 11: scription of their MapTask subset is not sufficient to replicate their experiment, and Surendran and Levow's data split is not accessible. Notwithstanding the difference in conditions, our model's accuracy is superior to theirs.
As seen in Tables 4 and 5, both of these resources provide valuable information, but the changes in performance due to the omission of these resources are smaller than those obtained with woUttRNN. Removing the DA connection (woDA2Attn) or the previous conversation-level RNN hidden vector (woHid2Attn) leads to statistically significant drops in accuracy and increases in negative log-likelihood on the test partitions of both datasets. The changes in performance with respect to the development and training sets vary across the datasets. As seen in Table 4, both models exhibit accuracy drops (and small increases in negative log-likelihood) on the Switchboard development set, but small accuracy increases (and negative log-likelihood drops) on the Switchboard training set -an indication of over-fitting. In contrast, as seen in Table 5, both models yield a negligible or no drop in accuracy on the MapTask development set, while both yield a drop in accuracy on the training set.
How important is the RNN at the conversation level? To answer this question, we create a variant of our HA-RNN model, denoted woConvRNN, where the recurrent connections between the units in the conversation-level RNN are removed. The LSTM basis function is calculated with a fixed vector g g g 0 instead of the previous time step's vector. Thus Equation 7 becomes Equation 12: As seen in Tables 4 and 5, HA-RNN outperforms woConvRNN on the training/development/test partitions of both datasets. The difference between the performance of HA-RNN and woConvRNN is statistically significant for the test set.
How effective are the DA connections? We have seen that the DA connections improve our model's performance when they are used to calculate the attentional signal. However, intuitively, the previous DA can also directly provide information about the current DA. For example, it is often the case that a Yes-No-Question is followed by Reply y or Reply n. To reflect this observation, we create another model, denoted wDA2DA, that has an additional direct connection between the previous DA and the current DA. That is, Equation 8 becomes Equation 13:   As seen in Tables 4 and 5, wDA2DA performs much worse than HA-RNN. We posit that this happens due to the exposure bias problem (Ranzato et al., 2015). That is, during training, the model has access to the correct DA of the previous utterance. However, during testing, the decoding process has access only to predicted DAs, which may lead to the propagation of errors. To quantify the effect of this problem on our model, we designed another experiment where the variants of our model can access the correct DA even during testing; the results for the test partitions of both datasets appear in Table 6.
The results in Table 6 show that exposure bias has different effects on the different variants of our model. As expected, woDA2Attn, which does not consider the previous DA, exhibits no change in performance between the oracle and greedy conditions. The models that employ a DA connection to compute the attention signal (HA-RNN, woUt-tRNN, woHid2Attn, woConvRNN) show a slight improvement in accuracy when using the correct DA as input, instead of the predicted DA. In contrast, wDA2DA shows large improvements when using the correct DA (3.5% on Switchboard and 6.8% on MapTask), becoming the best-performing model for both datasets. This improvement may be attributed to the direct connection between the DAs in this model, which increases the influence of previous DAs on the prediction of the current DA -previous DA predictions that are largely correct will substantially improve the performance of wDA2DA, while noisy DA predictions will have the opposite effect.

Attentional Analysis
We analyze how our model HA-RNN distributes attention over the tokens in an utterance in order to identify tokens in focus. Figure 3 shows how the attentional vector highlights the most important tokens in sample utterances in the context of the DA-classification task. For example, in "yes I do", the most important token that identifies the Reply y class is the token "yes", which receives most of the probability mass from the attention mechanism. Table 7 shows the most attended tokens for four classes of DA in MapTask. We compiled these lists by computing the average attention that a token received for all the utterances in a DA class (we excluded tokens that appear less than 5 times). As shown in Table 7, both important tokens "move" and "yes" in Figure 3 appear in their respective DA columns. Two of the most common  labels, Acknowledge and Reply y, have very similar attended tokens. In fact, many utterances in Acknowledge and Reply y have the same text form. Thus, the distinction between the two classes is highly dependent upon the conversational context. Also, note that although Reply n is not one of the most common DAs in MapTask, our model can still learn the most important tokens for this DA.

Conclusions
In this paper, we proposed a novel hierarchical RNN for learning sequences of DAs. Our model leverages the hierarchical nature of dialogue data by using two nested RNNs that capture long-range dependencies at the conversation level and the utterance level. We further combine the model with an attention mechanism to focus on salient tokens in utterances. Our experimental results show that our model outperforms strong baselines on two popular datasets: Switchboard and MapTask. In the future, we plan to address the exposure bias problem, and incorporate acoustic features and speaker information into our model.