Integrating User History into Heterogeneous Graph for Dialogue Act Recognition

Dialogue Act Recognition (DAR) is a challenging problem in Natural Language Understanding, which aims to attach Dialogue Act (DA) labels to each utterance in a conversation. However, previous studies cannot fully recognize the specific expressions given by users due to the informality and diversity of natural language expressions. To solve this problem, we propose a Heterogeneous User History (HUH) graph convolution network, which utilizes the user’s historical answers grouped by DA labels as additional clues to recognize the DA label of utterances. To handle the noise caused by introducing the user’s historical answers, we design sets of denoising mechanisms, including a History Selection process, a Similarity Re-weighting process, and an Edge Re-weighting process. We evaluate the proposed method on two benchmark datasets MSDialog and MRDA. The experimental results verify the effectiveness of integrating user’s historical answers, and show that our proposed model outperforms the state-of-the-art methods.


Introduction
Dialogue Acts Recognition (DAR) is an important but challenging task in Natural Language Understanding (NLU), which aims to attach Dialogue Act (DA) labels to each utterance in a conversation and recognize the speaker's intention. Automatic DAR can be applied to many applications such as question answering, speech recognition and dialogue systems (Higashinaka et al., 2014;Khanpour et al., 2016). In this work, different from recognizing a single DA, we focus on the task of recognizing multiple DA in a multi-party conversation (i.e., a forum). The latter is difficult but is more common in the real world. Figure 1 shows an example of multiple DA recognition in a tech forum.
Previous studies have proposed deep learning models, which approach DAR as a multi-classification problem Lee and Dernoncourt, 2016) or a sequence labeling problem (Kumar et al., 2018;Chen et al., 2018;Raheja and Tetreault, 2019;Li et al., 2019). Most of these approaches assume that utterances are sequentially organized, ignoring the rich interaction between multiple users in a conversation. To alleviate such a problem, some recent studies exploit graph-structured networks (Ghosal et al., 2019;Hu et al., 2019), which leverage speaker interaction of the interlocutors to model conversational context. However, due to the informality and diversity of natural language expressions, the same intention has a very rich form of expression. In forums where many users participate, it is more common for different users to express their intentions using personalized expressions. When encountering the unobvious intention or uncommon expression pattern, existing deep learning methods that generalize the utterance feature into a low dimensional vector may degrade the performance of DA recognition.
Intuitively, the user's historical expressions extracted according to the DA category can be used as a DA-specific clue to help the encoding process recognize the user's inexplicable or uncommon expressions. For example, as shown in Figure 1, we intent to recognize three DA labels for T 4 . Although we can easily label T 4 as Greetings/Gratitude (GG) simply based on the utterance features (i.e. "Thanks"), to detect the Follow-up Question (FQ) label, additional information from T 2 need to be considered. Besides, T 4 seems to be a question and there is no explicit Negative Feedback (NF). However, from the user's historical answers that reflect NF label such as H 1 , H 2 , we find that the phrase "are you sure" indicates a negative feedback that the user doubts about something.
Inspired by this, we propose a Heterogeneous User History (HUH) graph convolution networks, which integrates utterance, conversation, and user's historical answers into multiple DA recognition. The proposed DA recognition process can be divided into two stages. In the 1st-phase, we first extract the hidden features for each utterance by stacking a convolutional neural network (CNN) utterance encoder and a Bi-directional Long Short-Term Memory (BiLSTM) utterance context encoder, and we use the hidden features to predict the initial score of DA labels for each utterance. In the 2nd-phase, we use the user's historical answers as additional information to construct a Heterogeneous User History (HUH) graph, and we use a Relation-weighted graph Convolutional Network (RGCN) (Schlichtkrull et al., 2018) to learn this graph and model the interaction between utterance and user's historical answers before recognizing user's intentions.
Despite the benefits of user's historical answers, improper use may inevitably bring a lot of noise. We thus further design sets of denoising mechanisms. Firstly, we use a History Selection process with similarity measures to filter out the irrelevant user's historical answers. Then, we adopt the learned initial score of DA labels from the 1st-phase to re-weight the similarity matrix between the user's historical answers and conversation, and re-weight the edges in the proposed HUH graph, thereby reducing the noise caused by introducing supplementary information.
The main contributions of our work can be summarized as follows: 1) We propose a novel Heterogeneous User History (HUH) graph convolution network, which models the interaction between users and integrates utterance, conversation and user's historical answers for recognizing user's intent. To the best of our knowledge, we are the first to integrate user's historical answers into heterogeneous graph on DAR. 2) To alleviate the noise issue caused by introducing the user's historical answers, we design sets of denoising mechanisms, including a History Selection process, a Similarity Re-weighting process, and an Edge Re-weighting process. 3) We evaluate our model on two benchmark datasets MSDialog and MRDA. Compared to the state-of-the-art DAR methods, our model achieves better performance. The experimental results verify the effectiveness of incorporating the user's historical answers.

Related Work
The goal of the DAR is to assign DA labels to each utterance in a conversation. Early studies on DAR are mostly based on general statistical machine learning methods and approach this task as a multi-class classification problem or a sequence labeling problem, such as Hidden Markov Model (HMM) (Stolcke et al., 2000), Support Vector Machines (SVM) (Surendran and Levow, 2006) and Bayesian Network (Keizer et al., 2002). Recent studies on DAR have proposed deep learning models and have obtained promising results. Deep learning approaches typically model the interaction between adjacent utterances Lee and Dernoncourt, 2016). Some researchers capture the dependencies among both utterances and labels with Conditional Random Field (CRF) (Kumar et al., 2018;Chen et al., 2018;Raheja and Tetreault, 2019;Li et al., 2019). Furthermore, Colombo et al. (2020) leverage a sequence to sequence approach to model both the conversation and the global tag dependencies. Besides, some researches explore joint models to solve DAR and sentiment classification simultaneously in a unified framework (Cerisara et al., 2018;Kim and Kim, 2018;Qin et al., 2020). However, these methods assume that utterances are sequentially organized, ignoring the rich interaction process between users in a conversation.
Some recent researches design graph-structured networks to model speaker interaction in a conversation. Hu et al. (2019) first propose a graph-structured network (GSN) to model graph-structured dialogues for response generation. Ghosal et al. (2019) leverage self and inter-speaker interaction of the interlocutors to model conversational context for emotion recognition. Although these methods can play a role, they do not make full use of the user information in the conversation. The recent study (Wen et al., 2018) encodes the user's historical answers to boost community question answering (CQA) task. But it only considers a single user and ignores the noise issue caused by introducing multiple users' historical answers. In this paper, we propose a Heterogeneous User History (HUH) graph convolution networks, which integrates utterance, conversation, and user's historical answers into multiple DA recognition. The experimental results verify the effectiveness of integrating user's historical answers, and show that our proposed model outperforms the state-of-the-art methods.

Methodology
Before describing our proposed model, we first introduce the basic mathematical notions and terminologies for the problem of DAR. The task of DAR takes a conversation C as input, which contains a sequence of utterances {U t } N t=1 . For each utterance U t (the t-th utterance) in a conversation, we predict a subset of DA labels y t = {y t 1 , y t 2 , ..., y t S } that describes the functionality of the utterance from a candidate set of DA labels D = {d 1 , d 2 , ..., d S }. And y t j = {0, 1} indicates whether the t-th utterance is labeled with DA label d j . For each DA label, we use the user's historical answers belonging to that DA label to construct a label node. Specifically, for the j-th DA label d j , we retrieval a series of user's historical answers belonging to this label. We select top-K user's historical answers that are most relevant to the conversation, and sum them by similarity weight to generate a label node e j corresponding to d j . Here, K is a hyperparameter.
The overall architecture of our proposed model HUH is shown in Figure 2. HUH mainly contains two phases: 1) The 1st-phase, as shown in the left part, aims to predict the initial score of DA labels for each utterance as a guide to the 2nd-phase process. 2) The 2nd-phase, as shown in the right part, constructs a Heterogeneous User History graph to integrate utterance, conversation, and user's historical answers into multiple DA recognition. Additionally, the proposed denoising mechanisms are presented at the corresponding phase in Figure 2. In the following sections, the details of our framework are given.

1st-phase: Encoding Utterances
In the 1st-phase, we predict the initial score of DA labels for each utterance. For each word in an utterance, we firstly convert them into pre-trained word-level embeddings with Glove (Pennington et al., 2014) as initialization, and get an utterance representation (the t-th utterance) as U t = {w t 1 , w t 2 , ..., w t L }. And then we use a CNN (Kim, 2014) followed by max-pooling to extract utterance features as: (1)  Based on the local utterance features extracted from CNN, a BiLSTM is applied to gather features from the context: whereĥ t ∈ R 2d h is a sequential context-aware utterance representation for the t-th utterance and d h is the hidden size of BiLSTM. During the training phase, we save theĥ t as utterance-level user's historical answers representation and provide it to the 2nd-phase.
With the extracted local textual featuresû t and context-aware featuresĥ t , we predict the initial score of DA labels for each utterance:p where [û t ,ĥ t ] represents the concatenated result and W α ,b α are weight matrices to be learned.ŷ t ∈ R S represents the initial score of DA labels for the t-th utterance and S is the number of the DA labels. In the 1st-phase, we can also calculate the DA labels prediction loss here, denoted as loss 1p . Though loss 1p will not be used as the final prediction, it is also a good auxiliary loss for training 1st-phase.

2nd-phase: Integrating User's History
To capture the interaction between users and integrate the user's historical answers into the utterance encoding properly, we present a novel Heterogeneous User History graph convolution network. We denote our Heterogeneous User History graph as G = (V, E, R, W), where V stands for node representations and E represent edges between nodes, R and W are the type and weight of the edges.

Graph Node
There are two kinds of nodes in our heterogeneous graph: Utterance Node and Label Node. Utterance Node To represent utterance nodes, we share the same encoder (parameter sharing) with the 1st-phase to get the sequentially encoded feature vector h i for all i ∈ [1, 2, ..., N ], where N is the number of utterances in the conversation.
Label Node For each DA label, there is a corresponding label node. And we use the user's historical answers to generate the representation of the label node. At first, we retrieve all the historical answers of the users enrolled in the conversation from the training set, and group the answers according to the DA labels. And then, for the j-th DA label, we select top-K user's historical answers (denoted as H(j)) most relevant to the current conversation and convert them into utterance-level representation learned from the 1st-phase. We sum these user's historical answers by weight and generate corresponding label node as e j for all j ∈ [1, 2, ..., S]: where S is the number of DA labels,ĥ k is the utterance-level representation as Formula (4) learned from the 1st-phase and the weights we use here are simple initialized as α k = 1 K .

Graph Edge
We define the following types of edges between pairs of nodes to encode various structural information in our graph: Speaker Edge Every speaker in the conversation is effected by himself and other speakers, resulting in two different edge types: one to one's self and one to others. In addition, the impact between utterances depends on the relative position that occurs in the conversation: before or after. As a result, there are a total of 2 × 2 = 4 different speaker edge types and there can be N 2 speaker edges in a conversation, where N is the number of utterances in the conversation. For each speaker edge, we use a similaritybased attention module to obtain edge weight: where utterance node h i , i ∈ [1, 2, ...N ] which has incoming edges with utterance nodes h 1 , ..., h N receives a total weight contribution of 1. Label Edge We use a label edge to connect the label node with the utterance node. As an illustration, we have a label node e j and an utterance node h i , and we construct a directed label edge from e j to h i as e j → h i . Each label node e j , j ∈ [1, ..., S] is connected to all utterance nodes h i , i ∈ [1, ..., N ] and there can be N × S = N S label edges in a conversation. We initial the weight of these label edges as: where S is the number of DA labels and we give each label node the same weight.

Message Passing
To consider various relationships between nodes, we use a relation specific message passing strategy inspired by RGCN (Ghosal et al., 2019), which can be formulated as: where z j represents a graph node representation, α i,j and α i,i are edge weights, N r i denotes the set of neighbor indices of node i under relation r ∈ R. c i,r is a problem-specific normalization constant that can either be learned or chosen in advance (such as c i,r = |N r i |). σ is an activation function (ReLU), W (l) r and W (l) 0 are learnable parameters.

Denoising Mechanisms
Inevitably, introducing users' historical answers from conversations on different topics will bring a lot of noise. And it is difficult to extract useful information from user's historical answer by simple average like Formula (7) and Formula (9). To alleviate the noise issue and make full use of the user's historical answers, we design sets of denoising mechanisms, including a History Selection process to filter out the historical answers that are less relevant to the conversation, a Similarity Re-weighting process to re-weight the similarity matrix between historical answers and conversation, and an Edge Re-weighting process to re-weight the label edge weights.
History Selection To filter out the irrelevant history from a large number of user's historical answers, we propose a coarse-grained History Selection process. In this process, we retrieve all the historical answers of the users enrolled in the conversation from the training set, and group the answers according to DA labels. For the i-th historical answers in the j-th DA label (denoted as U j i ), we convert each word in U j i into pre-trained word embeddings and apply max-pooling on the word-level to generate utterancelevel representation u j i ∈ R d : where d is the dimension of word embeddings, L u is the length of the user's historical answer. And then, for current conversation C, we convert each word in C into pre-trained word embeddings and apply max-pooling on the word-level and utterance-level to generate conversation-level representation c ∈ R d : where L c is the length of the utterance in the conversation and N is the number of utterances in the conversation. And then we calculate the cosine similarity between user's historical answer U j i and convsersation C and select top-K historical answers (denoted as H(j)) most relevant to current conversation: Similarity Re-weighting The History Selection is a coarse-grained selecting process, which cannot accurately measure the relevance between the user's historical answers and the current conversation. Therefore, we design a Similarity Re-weighting process to re-weight the similarity matrix. Considering a conversation with N utterances C = {h 1 , h 2 , ..., h N } and user's historical answers for the j-th DA label with K utterances H(j) = {ĥ j 1 ,ĥ j 2 , ...,ĥ j K }, we calculate a similarity matrix M ∈ R N ×K : where f is a similarity function, and the function we choose here is: where W s is a parameter to be learned. And then, we use the initial score of DA labelsŷ ∈ R N ×S as Formula (6) learned from the 1st-phase to re-weight the similarity matrix. Firstly, we transposeŷ and select the j-th row of theŷ T as the attention of each utterance under the j-th DA label. Next, we use the attentionŷ j ∈ R N ×1 to re-weight the similarity matrix M as M ∈ R N ×K and we apply a column-wise max-pooling to get the weight m ∈ R K . Finally, we sum the user's historical answers by weight and get the representation of label node e j ∈ R 2d h , which replaces the original Formula (7): where [j] represents selecting the j-th row. Edge Re-weighting Each utterance in a conversation is related to only a few DA labels and not to most others, so irrelevant label nodes will bring the noise to the utterance. The main idea of this process is to re-weight the edges between the utterance node and the label node, so that the weight of the irrelevant label nodes is lower. We use prediction resultp ∈ R N ×S as Formula (5) learned from the 1st-phase to re-weight the label edge weights as: which replaces the original Formula (9).

Dialogue Act Recognition
The local utterance feature vector u t (from utterance encoder), contextually encoded feature vector h t (from utterance context encoder) and g t (from Heterogeneous User History graph) are concatenated and fed into a fully-connected network to obtain the final prediction results: where W α is also used in the 1st-phase, we reload the W α from the 1st-phase for further training. To make the dimensions consistent, we set the g t as zeros in the 1st-phase. Our Dialogue Act Recognition is a multi-label classification problem, so we use sigmoid as activation and use Binary Cross-Entropy (BCE) as loss function.

Datasets and Evaluation Metrics
We evaluate the performance of our model on two benchmark datasets used in several prior studies for the DAR task. MSDialog dataset ) is a labeled dialog dataset of question answering (QA) interactions between information seekers and providers from an online forum on Microsoft products. The dataset contains more than 2,000 multi-turn QA dialogs with 10,020 utterances that are annotated with a subset of 12 user intent on the utterance level. The ICSI Meeting Recorder Dialogue Act Corpus (Janin et al., 2003;Shriberg et al., 2004;Ang et al., 2005) (MRDA) contains 72 hours of naturally occurring multi-party meetings that were first converted into 75 word level conversations. The original MRDA tag set had 11 general tags and 39 specific tags. Following previous work (Qu et al., 2019) on multi-label classification, we adopt label-based accuracy (i.e., Hamming score) and micro-F1 score as our main evaluation metrics.

Implementation Details
In our experiments, we split training/validation/testing datasets following Yu et al. (2019) for MSDialog and Lee and Dernoncourt (2016) for MRDA. For two datasets, we first strip punctuation, and then we convert the characters into lower-case and tokenize the texts with NLTK 1 . Pre-trained GloVE embeddings of 100 dimensions are adopted as word-level embeddings. Out-of-vocabulary words are set by randomly sampling values from the standard normal distribution. The max length of utterance is set to 800 in MSDialog and 80 in MRDA. All the hyper-parameters have been optimized on the validation set using accuracy. For CNN, we use filters of size 3, 4 and 5 with 200 feature maps in each dataset. The hidden size of LSTM and GCN is set to 400 in MSDialog and 200 in MRDA. We use Adam optimizer for optimization with learning rate 1e-3. The hyperparameter K is set to 10 in MSDialog and 90 in MRDA. Note that MRDA conversations are much longer compared to MSDialogue (1000 vs 10), we split the MRDA conversations into smaller parts containing a maximum of 90 utterances.

Baselines
For a comprehensive evaluation of our proposed model HUH, we compare our model with the following baseline methods: 1) HEC (Kumar et al., 2018) builds a hierarchical BiLSTM-CRF model for DAR, which learns representations at multiple levels. 2) CNN-CR (Qu et al., 2019) designs a CNN model that incorporates context information with a window size of 3. 3) CASA (Raheja and Tetreault, 2019) proposes a context-aware self-attention mechanism coupled with a hierarchical recurrent neural network for DAR. 4) GA-Seq (Colombo et al., 2020) leverages a sequence to sequence approach to improving the modeling of tag sequentiality. 5) CRNN (Yu et al., 2019) is an adapted Convolutional Recurrent Neural Network (CRNN) that models the interactions between utterances of long-range context. 6) Di-alogueGCN (Ghosal et al., 2019) proposes a graph-based model that leverages self and inter-speaker interaction of the interlocutors to model conversational context for emotion recognition. 7) BERT (Devlin et al., 2019) is a pre-trained language model which has been applied to many NLU applications. We encode utterance through BERT alongwith a feedforward network for classification.

Experimental Results
2) The major difference between our proposed model and the strong baseline Dia-logueGCN lies in three aspects: First, we integrate the user's historical answers into our model, rather than just focusing on the utterances in the conversation. Second, we divide the recognition process into two organically combined phases and use the initial score of DA labels to guide the Heterogeneous User History graph-based prediction. Third, we employ sets of denoising mechanisms to filter out the irrelevant content from the user's historical answers. These three modifications improve the performance significantly. 3) Compared to the large pretrained model BERT with 110M parameters, our model achieves comparable performance while the model size is much smaller (20M). What's more, by replacing the utterance encoder in our model with BERT and fine-tuning on the target datasets, a further improvement is gained and our model achieve the state-of-the-art performance.

Ablation Study
To analyze the effectiveness of different factors of our HUH, we report the ablation test in terms of: 1) w/o graph: We use the initial score of DA labels from the 1st-phase as the final result. 2) w/o user's history we discard the label nodes in our proposed Heterogeneous User History graph. 3) w/o denoise We replace the History Selection process with random selection and discard the Similarity Re-weighting process and Edge Re-weighting process. The results are listed in Table 1. Compared our proposed HUH with w/o user's history, we can conclude that adding the user's historical answers can improve the performance of the DAR task. However, such historical answers that contain irrelevant information might sometimes play as a negative role, which makes w/o denoise less performing than HUH. Our model outperforms w/o graph by a great margin, demonstrating the contribution of the graph structure and user's history in our model. Table 2 shows the f1 score of HUH, DialogueGCN and CRNN for DA labels on the MSDialog dataset.

Quantitative Analysis
We can observe that our model achieves the best performance on all DA labels. Besides, the three models achieve satisfying performance on OQ, PA and GG, but not on RQ, NF and JK. We analyze the main reason for the poor performance of the latter is the vague expression of sentence patterns. We observe that HUH and DialogueGCN perform much better than CRNN for the Repeat Question (RQ) label, this mainly because our model and DialogueGCN utilize the graph-structured model to capture   the interaction between users, and the RQ label needs to consider the long-range context from different users in the conversation. Furthermore, we observe that our model outperforms the other methods on NF by a noticeable margin. A possible reason is that the negative feedback is an unobvious intention, which makes the model difficult to recognize from scratch. By introducing the user's historical answers as clues, our model can extract the hidden feature in the utterance more easily and improve the performance on the labels that are less evident.
To further analyze the performance of different models, some examples from MSDialog are demonstrated in Table 3. We find that a model needs to find the semantic relevance between "webcam work with skype" and "access camera for skype" to recognize the Repeat Question (RQ) label in the U3. While previous methods of sequentially modeling conversations (such as CRNN) lack the ability to capture long-range interaction between utterance. In addition, we can observe that the previous methods cannot recognize the Negative Feedback (NF) contained in U5. This might due to the reason that the negative feedback expressed by "spend years and nothing changes" is unobvious. Compared with DialogueGCN and CRNN, our method can recognize the NF label accurately.

Conclusion
In this paper, we focus on the task of multiple DA recognition in a multi-party conversation. We propose a Heterogeneous User History (HUH) graph convolution network model to learn utterance, conversation and user's historical answers. To handle the noise caused by introducing the user's historical answers, we design sets of denoising mechanisms, including a History Selection process, a Similarity Re-weighting process and an Edge Re-weighting process. We evaluate the proposed method on two benchmark datasets MSDialog and MRDA. The experimental results verify the effectiveness of integrating user's historical answers, and show that our proposed model outperforms the state-of-the-art methods.