A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification

Recognising dialogue acts (DA) is important for many natural language processing tasks such as dialogue generation and intention recognition. In this paper, we propose a dual-attention hierarchical recurrent neural network for DA classification. Our model is partially inspired by the observation that conversational utterances are normally associated with both a DA and a topic, where the former captures the social act and the latter describes the subject matter. However, such a dependency between DAs and topics has not been utilised by most existing systems for DA classification. With a novel dual task-specific attention mechanism, our model is able, for utterances, to capture information about both DAs and topics, as well as information about the interactions between them. Experimental results show that by modelling topic as an auxiliary task, our model can significantly improve DA classification, yielding better or comparable performance to the state-of-the-art method on three public datasets.


Introduction
Dialogue Acts (DA) are semantic labels of utterances, which are crucial to understanding communication: much of a speaker's intent is expressed, explicitly or implicitly, via social actions (e.g., questions or requests) associated with utterances (Searle 1969).Recognising DA labels is important for many natural language processing tasks.For instance, in dialogue systems, knowing the DA label of an utterance supports its interpretation as well as the generation of an appropriate response.In the security domain, being able to detect intention in conversational texts can effectively support the recognition of sensitive information exchanged in email conversations within a company, which can be extremely valuable for IT managers or the security department (Verma, Shashidhar, and Hossain 2012).
A wide range of techniques have been investigated for DA classification.Early works on DA classification are mostly based on general machine learning techniques such as Hidden Markov models (HMM) (Stolcke et al. 2000), dynamic Bayesian networks (Dielmann and Renals 2008), and Support Vector Machines (SVM) (Liu 2006).Recent studies to the problem of DA classification have seen an increasing uptake of deep learning techniques, where promising results have been obtained.Kalchbrenner and Blunsom (2013) model a DA sequence with a recurrent neural network (RNN), where sentence representations are constructed by means of a convolutional neural network (CNN).Kumar et al. (2017) propose a hierarchical, bidirectional long shortterm memory (Bi-LSTM) model with a conditional random field (CRF) for DA classification, achieving an overall accuracy of 79.2% on the SWDA dataset.There is also work exploring different deep learning architectures (e.g., hierarchical CNN or RNN/LSTM) to incorporate context information for DA classification, showing that incorporating context information improves DA classification (Liu et al. 2017).
Most of the deep learning approaches to DA classification utilise the dependencies from data, e.g., the dependency between adjacent utterances (Ji, Haffari, and Eisenstein 2016) as well as the implicit and intrinsic dependencies among DAs (Kumar et al. 2017).It has been observed that conversational utterances are normally associated with both a dialogue act and a topic, where the former captures the social act (e.g., promising) and the latter describes the subject matter (Wallace et al. 2013).In addition, the set of DAs associated with a conversation is likely to be affected by the topic of the conversation.For instance, DAs such as request and suggestion might appear more frequently in conversations relating to topics about work.However, such a reasonable source of information, surprisingly, has not been explored in the deep learning literature for DA classification.We hypothesize that modelling the topics of utterances as an auxiliary task may effectively support dialogue act classification.
In this paper, we propose a dual-attention hierarchical recurrent neural network for dialogue act classification.Our model is distinguished from existing methods in a few aspects.First, compared to the flat structure employed by existing models (Khanpour, Guntakandla, and Nielsen 2016;Ji, Haffari, and Eisenstein 2016;Tran, Zukerman, and Haffari 2017b), our hierarchical recurrent neural network can represent the input at the word, utterance, and conversation levels, preserving the natural hierarchical structure of a conversation.Second, our model is able to incorporate rich context information for DA classification with a novel taskspecific dual-attention mechanism.Employing attention into our model sheds light on the observation that different di-alogue acts are semantically related to different words in an utterance (Tran, Zukerman, and Haffari 2017a).Third, apart from incorporating the commonly used dependencies between utterances, our dual-attention mechanism can further capture, for utterances, information about both dialogue acts and topics.This is a useful source of context information which has not previously been explored in existing deep learning models for DA classification.
We evaluate our model against several strong baselines (Kalchbrenner and Blunsom 2013; Lee and Dernoncourt 2016; Khanpour, Guntakandla, and Nielsen 2016;Ji, Haffari, and Eisenstein 2016;Kumar et al. 2017) on the task of dialogue act classification.Extensive experimentation conducted on two publicly available datasets (namely Switchboard (Jurafsky 1997) and DailyDialog (Li et al. 2017)) shows that by modelling the topic information of utterances as an auxiliary task, our model can significantly improve DA classification, yielding comparable performance to state-ofthe-art deep learning methods (Kumar et al. 2017) in classification accuracy.

Related Work
Dialogue Act (DA) recognition is a supervised classification task, where each utterance in a conversation is assigned with a DA label.Broadly speaking, methods for DA classification can be divided into two categories, i.e., instancebased methods and sequence labelling methods.Instancebased methods treat each utterance as an independent data point and predict the DA label for each utterance separately, e.g., naive Bayes (Grau et al. 2004) and maximum entropy (Ang, Liu, and Shriberg 2005).In contrast, sequence labelling methods cast DA recognition as a sequence labelling task where the dependency among consecutive utterances are taken into consideration, where example methods include Hidden Markov Models (HMM) (Stolcke et al. 2000) and Conditional Random Fields (CRF) (Kim, Cavedon, and Baldwin 2010).
Recently, deep learning has been widely applied in many natural language processing tasks, including DA classification.Kalchbrenner and Blunsom (2013) proposed to model a DA sequence with a recurrent neural network (RNN) where sentence representations were constructed by means of a convolutional neural network (CNN).Lee and Dernoncourt (2016) tackled DA classification with a model built upon RNNs and CNNs.Specifically, their model can leverage the information of preceding texts, which can effectively help improve the DA classification accuracy.More recently, a latent variable recurrent neural network was developed for jointly modelling sequences of words and discourse relations between adjacent sentences (Ji, Haffari, and Eisenstein 2016).In their work, the shallow discourse structure is represented as a latent variable and the contextual information from preceding utterances are modelled with a RNN.Kumar et al. (2017) proposed a hierarchical, bidirectional long short-term memory (Bi-LSTM) model with a CRF for DA classification, where the inter-utterance and intra-utterance information are encoded by a hierarchical Bi-LSTM and the dependency between DA labels is captured by a CRF.
In addition to modelling dependency between utterances, various contexts have also been explored for improving DA classification or modelling DA under multi-task learning.For instance, Wallace et al. (2013) proposed a generative joint sequential model to classify both DA and topics of patient-doctor conversations.Their model is similar to the factorial LDA model (Paul and Dredze 2012), which generalises LDA to assign each token a K-dimensional vector of latent variables.The model of Wallace et al., only assumed that each utterance is generated conditioned on the previous and current topic/DA pairs.In contrast, our model is able to model the dependencies of all utterances of a conversation, and hence can better capture the effect between DAs and topics.Qin, Wang, and Kim (2017) introduced a joint model for identifying salient discussion points in spoken meetings as well as labelling discourse relations.They assumed that the interaction between content and discourse relations might improve the classification performance on both phrase selection and DA classification.A tree-structured discourse was constructed to jointly model the content and discourse relations.Lexical and syntactic features were utilised for the two tasks, such as TF-IDF scores for words, part of speech (POS) tags, etc. Shen and Lee (2016) proposed a neural attention model for DA detection and key term extraction, where their model shows that the attention mechanism is effective for sequence classification.

Given a training corpus
t=1 are the corresponding sequences of dialogue act (labels) and topics for C n , respectively.Each utterance u t = w i t K i=1 of a conversation C n is a sequence of K words.Our goal is to learn a model from D, such that, given an unseen conversation C u , the model can predict the dialogue act (labels) of the utterances of C u .
Figure 1 gives an overview of the proposed Dual-Attention Hierarchical recurrent neural network (DAH).We adopt a shared utterance encoder for the input, which encodes each word w i t of an utterance u t into a vector h i t .The dialogue act attention and topic attention mechanisms capture DA and topic information as well as the interactions bewteen them.The outputs of the dual-attention are then encoded in the corresponding conversation-level sequence taggers (i.e., g t and s t ), based on the corresponding utterance representations and target labels.

Shared Utterance Encoder
In our model, we adopt a shared utterance encoder to encode the input utterances.Such a design is based on the rationale that the shared encoder can transfer knowledge between two tasks and reduce the risk of overfitting.Specifically, the shared utterance encoder is implemented using the bidirectional gated recurrent unit (BiGRU) (Cho et al. 2014), which encodes each utterance u t = w i t K i=1 of a conversation C n as a series of hidden states h i t K i=1 .Here, i indicates the timestamp within a sequence, and we define h i t as follows where concat( where is the hidden state for word w i−1 t , e i t is the word embedding of w i t , and r i , z i , n i are the reset, update, and new gates, respectively.Sigmoid (denoted as σ) and tanh functions are applied to each element of their vector arguments as pointwise operations, and denotes element-wise multiplication.
eters that need to be estimated.Finally, the backward GRU encodes u t from the reverse direction (i.e.w K t , . . ., w 1 t ) and generates ← − i=1 following the same formulation as the forward GRU.

Task-specific Attention
Recall that one of the key challenges of our model is to capture for each utterance, information about both dialogue acts and topics, as well as information about the interactions between them.We address this challenge by incorporating into our model a novel task-specific dual-attention mechanism, which accounts for both DA and topic information extracted from utterances.In addition, DAs and topics are semantically relevant to different words in an utterance.With the proposed attention mechanism, our model can also assign different weights to the words of an utterance by learning the degree of importance of the words to the DA or topic labelling task, i.e., promoting the words which are important to the task and reducing the noise introduced by less important words.
For each utterance u t , the dialogue act attention calculates a weight vector i=1 , the hidden states of u t .u t can then be represented as a weighted combination vector In contrast to the traditional attention mechanism (Bahdanau, Cho, and Bengio 2014), which only depends on one set of hidden vectors from the Seq2Seq decoder, the dialogue act attention in DAH relies on two sets of hidden vectors, i.e., g t−1 of the conversation-level DA tagger and s t−1 of the conversation-level topic tagger, where the interaction between DAs and topics in each task-specific attention mechanism can capture, for utterances, information about both DAs and topics.Specifically, the weights for the dialogue act attention are calculated by ) where The topic attention layer has a similar architecture to the dialogue act attention layer, which takes as input both s t−1 and g t−1 .Similar to the dialogue act attention, the weight vector β i t K i=1 for the topic attention output v t can be calculated as follows where Note that W (act) , V (act) , U (act) , b (act) , W (topic) , V (topic) , U (topic) and b (topic) are vectors of parameters that need to be learned during training.

Conversational Sequence Tagger
Dialogue act sequence tagger.The conversational dialogue act sequence tagger predicts the next DA y t conditioned on the attention vector l t and all previous predicted DAs y i t−1 i=1 (c.f. Figure 1).Formally, this conditional probability can be formulated as p (y t |l t , y 1 , . . ., y t−1 ) , (9) where p (y t |l t , y 1 , . . ., y t−1 ) = softmax (g (g t , l t )) (10) Here C = u t T t=1 is the sequence of all utterances seen so far, T is the length of a conversation.g t is the hidden state of the conversational DA tagger for the t-th utterance, l t is the attention vector of u t , g(•) is a linear transformation function, A and b are model parameters which need to be learned during training.
Vector g t is calculated in a GRU (denoted as f ): In training, teacher forcing (Williams and Zipser 1989) with a value of 0.5 is used for label y t−1 in order to avoid accumulation of false prediction.Topic sequence tagger.The conversational topic sequence tagger is designed to predict z t conditioned on v t and all previous predicted topics z i t−1 i=1 .Similar to the formulation of the dialogue act tagger, we have Here C = u t T t=1 is also the sequence of all utterances seen so far, s t is the hidden state of the conversational topic tagger for the t-th utterance, v t is the attention vector of u t , and A and b are model parameters.
Let Θ be all the model parameters that need to be estimated for the DAH model.We can then estimate Θ based on n=1 by minimising the objective function below, which seeks to jointly optimise the prediction for both dialogue acts and topics The hyper-parameter α controls the contribution of the conversational topic tagger towards the objective function.In our experiments, α = 0.1 is determined empirically.

Experimental Settings Datasets
We evaluate the performance of our model on two publicly available dialogue datasets, namely, Switchboard (Jurafsky 1997) and Dailydialog (Li et al. 2017).Switchboard Dialogue Act Corpus (SWDA).The SWDA dataset1 consists of 1,155 two-sided telephone conversations labelled with 66 conversation-level topics (e.g., weather climate, air pollution, etc.) and 42 utterance-level dialogue acts (e.g., statement-opinion, statement-non-opinion, whquestion).The average speaker turns per conversation, tokens per conversation, and tokens per utterance are 195.2,1,237.8, and 7.0, respectively.DailyDialog Corpus (DyDA).The DyDA dataset2 contains 13,118 human-written daily conversations, which are labelled with 10 different topics (e.g., tourism, politics, finance) at the conversation-level as well as four dialogue act classes at the utterance level, i.e., inform, question, directive and commissive.The former two classes are information transfer acts, while the latter two are action discussion acts.The average speaker turns per conversation, tokens per conversation, and tokens per utterance are 7.9, 114.7, and 14.6, respectively.The definition of the four mutually-exclusive categories of dialogue acts is as follows (Li et al. 2017): • "Inform class contains all statements and questions by which the speaker is providing information"; • "Questions class is labelled when the speaker wants to know something and seeks for some information"; • "Directives class contains dialogue acts like request, instruct, suggest and accept/reject offer"; • "Commissive class is about accept/reject request or suggestion and offer".

Implementation Details
For both experimental datasets (SWDA and DyDA), the top 15,000 words with the highest frequency are indexed.For SWDA, the standard split is adopted based on  1.
The input data is represented with 300-dimensional Glove word embeddings (Pennington, Socher, and Manning 2014) in order to capture the word similarity and accelerate model training.The shared encoder is a BiGRU with two layers, whereas the conversational sequence tagger is a GRU containing a single layer.We set the dimension of the hidden layers (i.e., h i t , g t and s t ) to 100 and applied a dropout layer (Srivastava et al. 2014) to both the shared encoder and the sequence tagger at a rate of 0.2.The Adam optimiser (Kingma and Ba 2014) is used for training with an initial learning rate of 0.001 and a weight decay of 0.0001.Each utterance in a mini-batch was padded to the maximum length for that batch and the maximum batch-size allowed is 10.• Bi-LSTM-CRF: A hierarchical bidirectional LSTM with a CRF as the top layer to classify dialogue acts (Kumar et al. 2017).Note that while all the aforementioned baselines model the dependency between the dialogue acts of a sequence of utterances, only the JAS model has attempted to model both dialogue acts and topics.All baselines above use the same test dataset as our model.

Dialogue Acts Classification
Table 2 shows the dialogue act classification results of our model and the baselines on the SWDA dataset.Among the baseline models, Bi-LSTM-CRF achieved the the best classification performance with 79.2% accuracy.It can also be observed that the deep learning models (e.g, Bi-LSTM-CRF, DRLM-Cond) in general give better performance than the traditional statistical models (i.e., HMM and JAS).
The SAH model, that only models dialogue acts, obtains 74.1% accuracy, which is better than JAS and RCNN.By jointly modelling dialogue act and topics, the DAH model achieves an overall accuracy of 78.3%, which is a significant performance boost over SAH (i.e., 4.2% higher; paired t-test p < 0.01).This result shows that the performance of DA classification can be improved significantly by using topic information.When comparing DAH with the baselines models, we can see that DAH achieves comparable performance to the state-of-the-art model Bi-LSTM-CRF (i.e., 78.3% vs. 79.2%).Although Bi-LSTM-CRF outperforms DAH, the architecture of DAH is simpler: Bi-LSTM-CRF employs a bidirectional LSTM in the conversational layer, and the DA classifier is a CRF which is more complicated than the softmax of DAH.Results on the DyDA dataset.We also evaluated our models on the DyDA dataset.As for the baselines, we ran and report the results for JAS and DRLM-Cond as only the source code for these two models are publicly available.Nevertheless, one should note that DRLM-Cond is the second-best performing baseline on the SWDA dataset.We fine-tuned the model parameters for both JAS and DRLM-Cond to make the comparison as fair as possible.
As can be seen from Table 3, DRLM-Cond performs better than JAS and achieves an overall accuracy of 81.1%.Our DAH and SAH models, in contrast, give much better performance where both models outperform DRLM-Cond for more than 3.2% on utterance-level dialogue act classification.As with the SWDA dataset, DAH outperforms the SAH model on DyDA.By examining the classification per-   To summarise, our DAH model achieves comparable performance to the-state-of-the-art for dialogue act classification on the SWDA dataset; it also gives the best classification performance on the DyDA dataset.Experimental results demonstrate that modelling conversational topic information as an auxiliary task does improve the classification on dialogue acts.

Analysing the Effectiveness of Joint Modelling Dialogue Act and Topic
In this section, we provide detailed analysis on why DAH can yield better performance than SAH by jointly modelling dialogue acts and topics.
Figure 2 shows the normalized confusion matrix derived from 10 DA classes of SWDA for both the DAH and SAH models.It can be observed that DAH yields improvement on recall for many DA classes compared to SAH, e.g., 17.8% improvement on bk and 7% on sv.For bk (Response Acknowledge) which has the highest improvement level, we see that the improvement largely comes from the reduction of misclassifing bk to b (Acknowledge Backchannel).The key difference between bk and b is that an utterance labelled with bk has to be produced within a question-answer context, whereas b is a "continuer" simply representing a response to the speaker.It is not surprising that SAH makes poor prediction as the utterances of these two DAs: they share many syntactic cues, e.g., indicator words such 'okay', 'oh', and 'uh-huh', which can easily confuse the model.When comparing the topic dis- tribution of the utterances under the bk and b categories (cf. Figure 4), we found topics relating to personal leisure (e.g., music, exercise and fitness, pets, and gardening) are much more prominent in bk than b.By leveraging the topic information, DAH can better handle the confusion cases and hence improve the prediction for bk significantly.
There are also cases where DAH performs worse than SAH.Take the DA pair of qo (Open Question) and qw (wh-questions) as an example.qo refers to questions like 'How about you?' and its variations (e.g., 'What do you think?'), whereas qw represents wh-questions which are much more specific in general (e.g.'What other long range goals do you have?').SAH gives quite decent performance in distinguishing qw and qo classes.This is somewhat reasonable as linguistically the utterances of these two classes are quite different, i.e., the qw utterance expresses very specific question and is relatively lengthy, whereas qo utterances tends to be very brief.We see that DAH performs worse than SAH, where quite a large percentage of qw utterances are misclassified as qo.This is likely due to the fact that there is no significant difference between the topic distribution of qw and qo as shown in Figure 4, and incorporating the topic information into DAH actually makes these two DAs less distinguishable for the model.
We also conducted a similar analysis on the DyDA dataset.As can be seen from the confusion matrices shown Finally, we show in Figure 5 a DA attention visualisation example of SAH and DAH for an utterance from SWDA.It can be seen that SAH gives very high weight to the word "because" and de-emphasizes other words.By modelling both DAs and topics with the dual-attention mechanism, DAH can capture more important words for the task (e.g., "reasonable", "ever", etc.) and correctly predicts the DA label as sd.

Conclusion
In this paper, we developed a dual-attention hierarchical recurrent neural network for dialogue act classification.Compared to the flat structure employed by existing models, our hierarchical model can better preserve the hierarchical structure of natural language conversations.More importantly, with the proposed task-specific dual-attention mechanism, our model is able to capture information about both dialogue acts and topics, as well as information about the interactions between them.Experimental results based on two public benchmark datasets show that modelling conversational topic information as an auxiliary task can effectively improve dialogue act classification, and that our model is able to achieve comparable performance to the state-of-theart deep learning methods for DA classification.

Figure 1 :
Figure 1: Overview of the dual-attention hierarchical recurrent neural network.

Figure 2 :
Figure 2: The normalized confusion matrix of 10 DAs using SAH (left) and DAH (right) in SWDA.

Figure 3 :
Figure 3: The normalized confusion matrix of 10 DAs using SAH (left) and DAH (right) in DyDA.

Figure 4 :
Figure 4: Topic distribution (the distribution of topic k under a DA label d is calculated by using the number of utterances associated with topic k and DA label d divided by the total number of utterances associated with the DA label d) of b, bk, qw and qo on 12 most prominent topics (1: gun control, 2: air pollution, 3: music, 4: universal public service, 5: crime, 6: pets, 7: latin america, 8: exercise and fitness, 9: basketball, 10: gardening, 11: space flight and exploration, 12: ethics in government).

Figure 5 :
Figure 5: DA Attention visualisation using SAH (upper) and DAH (lower).The true label of the utterance above is sd; SAH predicts DA as sv and DAH predicts DA as sd.

Table 1 :
|C| is the number of Dialogue Act classes, |T | is the number of conversation-level topic classes, |V | is the vocabulary size.Training, Validation and Testing indicate the number of conversations/utterances in the respective splits.

Table 2 :
DA classification results on the SWDA dataset.

Table 3 :
DA classification results on the DyDA dataset.formance of DAH and SAH on each dialogue act type, we see that both models achieve fairly similar performance on the Info, Questions classes, but DAH outperforms SAH on Directives and Commissive by more than 4% in F1 scores.This again proves that conversation-level topic information is helpful for dialogue act recognition.