Hierarchy Response Learning for Neural Conversation Generation

The neural encoder-decoder models have shown great promise in neural conversation generation. However, they cannot perceive and express the intention effectively, and hence often generate dull and generic responses. Unlike past work that has focused on diversifying the output at word-level or discourse-level with a flat model to alleviate this problem, we propose a hierarchical generation model to capture the different levels of diversity using the conditional variational autoencoders. Specifically, a hierarchical response generation (HRG) framework is proposed to capture the conversation intention in a natural and coherent way. It has two modules, namely, an expression reconstruction model to capture the hierarchical correlation between expression and intention, and an expression attention model to effectively combine the expressions with contents. Finally, the training procedure of HRG is improved by introducing reconstruction loss. Experiment results show that our model can generate the responses with more appropriate content and expression.


Introduction
Neural conversation generation (Xu et al., 2017;, focusing on responding to humans intelligently on a variety of topics, has drawn great attention from both academic and industry. The sequence-to-sequence model (Seq2Seq) (Sutskever et al., 2014) is one type of neural generation model that maximizes the probability of generating a response given the dialogue context. It enables the incorporation of rich context to generate coherent responses in an unsupervised manner. However, it was found that Seq2Seq models suffer from so-called safe response problem (Xu et al., 2017) , i.e., they tend to generate some dull and generic repetitive responses (e.g., "I think so", "I don't know', etc.), rather than meaningful and conscious expression. Xu et al. (2017) ascribed this to the fundamental nature of statistical models since the distribution of most pieces of information are relatively sparser when compared to the safe response patterns in the open domain conversations. Some works attempted to improve the architecture of Seq2Seq models, including introducing reinforcement learning , encouraging responses that have long-term payoff, etc. The other important reason is that the response generation model cannot express the intention and emotion internally. Thus, one line of research has focused on forcing the model to simulate some human's skills by augmenting the input with rich meta information. For example, some recent works biased the responses to some specific personas (Li et al., 2016b) or emotions (Huber et al., 2018).
Usually, in the process of human conversation, a speaker participates in the dialogue including the following steps. The speaker is firstly required to decide what the intention is to reflect the inner feelings or opinions. In the speaker's knowledge base, there may be varieties of appropriate expressions that can be found to represent his current intention. Therefore, a meaningful response can be produced by choosing one of the expressions and filling it with relevant content. For example, if a man wants to ask the way to the Park, he first needs to select an appropriate expression from a cluster of the expressions, e.g., "Where is the . . . ?", "How do I get to the . . . ?" and "Is the . . . far from here?", and then replaces ". . . " as the destination, i.e., the Park. As a crucial feature in natural conversation, dialog acts (Poesio and Traum, 1998) have been widely used in the dialogue managers to represent the intentions. Existing works introduce dialog acts to label a cluster of responses and a latent variable is learned to select a dialog act for response generation (Zhao et al., 2017;Serban et al., 2017a). However, it is not effective to capture the output diversity since the natural correlation between the expression patterns and dialog acts is not learnt. Intuitively, another latent level can be introduced to generate different expressions from the same dialog act, and a hierarchical structure can be used to model the response generation process. That is, the knowledge base is first constructed over the pairs of expressions and dialog acts to capture the latent correlation between them. Then, varieties of expressions can be selected from the knowledge base to be filled with the response content based on the latent correlation.
To learn the hierarchical model, it is quite challenging in large-scale conversation generation due to the following reasons. First, the semantic world is populated with a vast number of expressions, each of which corresponds to a specified label that reflects a kind of dialog act. Obtaining highquality expression-act data is impractical particularly in open domain conversations. Second, it is difficult to incorporate expression and content into the generation model in a nature and coherent way because they have different semantic representation patterns. Last, this process cannot be efficiently optimized using stochastic gradient descent (SGD) akin to backpropagation on feedforward neural networks.
To tackle the challenges, we propose to take advantage of the hierarchical nature of response generation. In particular, we investigate: (1) how to automatically learn a hierarchical model to naturally capture the response generation process; (2) how to adaptively learn and adjust the influence ratio between expression and content. Our solutions to these questions result in a new architecture for neural response generation. In particular, a novel hierarchical response generation (HRG) framework is proposed to effectively capture the process of response generation. An expression reconstruction model with a two-level probability structure is introduced to randomly generate the expressions, and an expression attention model is proposed to effectively fill the expressions with content. Finally, an efficient training method is proposed to learn the model within the framework of conditional variational autoencoders (CVAE) (Doersch, 2016). The main contributions are outlined as: • We propose to investigate the problem of generating variety and meaningful responses by imitating the human response process with a hierarchical response generation model.
• We propose an end-to-end framework to incorporate the expression and response content into the dialog generation. Our model is interpretable and even controllable compared to traditional generation model.
• We empirically demonstrate that our approach can generate responses with better expressions and content than traditional generation model.

Problem Statement
Our problem is formulated as follows: Given a dialog context C and a dialog act a of the response to be generated, the goal is to generate a response y = (y 1 , y 2 , ..., y n ) that is coherent with the dialog act a. Essentially, the model estimates the probability: P (y|C, a) = t P (y t |y <t , C, a). A simple implementation is to directly embed the act information into the Seq2Seq model. However, as shown in our experiments, it still suffers from safe response problem.
In this paper, we propose a novel hierarchical model to imitate the human thinking process in the conversation generation. The hierarchical generation process is: (1) for each dialog act a, a set Ω(a) is constructed to contain all the corresponding expressions of a; (2) to generate an expression, a dialog act a is first selected, and then an appropriate expression e is also selected from Ω(a); (3) a response is obtained by filling the expression e with relevant content according to the dialog context C. This hierarchical model allows us to express the responses with diverse expression templates of the same dialog act by drawing different samples from Ω(a). However, in addition to the difficulty of constructing high-quality set Ω(a), it is also needed to maximize the probability of each y in the training set with the objective: P (y|C, a) = e∈Ω(a) P (y|C, e) de, which is also difficult to compute by the numeric methods.
In our approach, the expression e is modeled as a conditional distribution over the dialog act a, i.e., p θ (e|a). The response is then generated by feeding the expression e obtained based on p θ (e|a) into the model, i.e., P (y|C, e), e ∼ p θ (e|a). Now, the training objective is simplified as follows: P (y|C, a) = P (y|C, e) p θ (e|a) de. (1) This objective can be transformed as the variational lower bound of CVAE (Doersch, 2016), and thus can be optimized efficiently. Specifically, the variational approximation q φ (e|y) is constructed to approximate the intractable posterior p θ (e|a). Assuming that the meaning of expression e is independent of C, we train the model by maximizing the variational lower bound, L(θ, φ; y, C,a) = −KL(q φ (e|y)||p θ (e|a)) where KL(·||·) is the Kullback-Leibler divergence to measure the distance between two distributions.
Note that in our problem statement, we assume that the dialog act of the to-be-generated response is given in advance, rather than predicted depending on the context. Many existing researches (Sacks et al., 1978;Young et al., 2013;Daniel Jurafsky, 2017) have explored dialog act interactions with dialog system and proposed some methods to decide the most appropriate dialog act for the response. In this paper, we only focus on response generation. During the testing process, we simply specify a dialog act to the model. We leave this study to our future work.

Hierarchical Response Generation
Building upon the encoder-decoder models (e.g., Seq2Seq, HRED(Serban et al., 2016)), a Hierarchical Response Generation (HRG) framework is proposed to effectively generate more diverse expressions for conversation generation. As shown in Figure 1, HRG contains two main modules: Expression Reconstruction model and Expression Attention model. A training method is proposed to learn the hierarchical HRG model in Section 3.3.

Expression Reconstruction
To maximize the objective Eq. (2), we are first required to model the networks q φ (e|y) and p θ (e|a). The task of network q φ (e|y) is to capture the expression representations from the responses while p θ (e|a) to sample an expression representation from the distributions associated with the specified dialog acts. As shown in Figure 2, in the framework of CVAE, the response is first encoded as a latent variable, and then a decoder is introduced to reconstruct the response. But this generation process is not interpretable and controllable. Thus, we  propose a novel method to reconstruct the expression with multiple latent variables by establishing links between q φ (e|y) and p θ (e|a).
Modeling q φ (e|y). It was found that the convolutional layer can extract the common patterns within the local regions of the input utterance (Kim, 2014). Therefore, the text convolutional network proposed in (Kim, 2014) is leveraged to mine the relationship between expression patterns and responses. Particularly, the network q φ (e|y) consists of several convolutional filtering, local contrast normalization, and max-pooling layers, followed by several connected linear layers. Formally, given a response y, the expression representation can be described as In the experiments, we found that the convolutional layers can effectively extract the expression representation by discarding the content-related information.
Modeling p θ (e|a). Given a dialog act a, the network p θ (e|a) outputs an expression representation associated with a. To capture output diversity, it is also necessary to generate different expressions each time given the same dialog act. The classical models (e.g., linear layer) are not capable of representing this feature. Instead, the network p θ (a|e) is easy to model by a linear layer, where Motivated by this observation, we propose an Inverse Linear Layer to model the network p θ (e|a) where the dialog act a is mapped inversely into the expression representation e by solving g −1 .
Penrose Theorem (Ben-Lsrael and Greville, 1976) gives a general solution of equation AX = B and proofs it. To solve g −1 , we simplify Penrose Theorem under an extra constraint and give a Corollary: where A + is Moore-Penrose inverse of A and I is an identity matrix.
The transformation of Ax inevitably leads to the information loss of x and thus the random vector z should be supplemented to the solution. According to Corollary 1, the expression e can be represented by an inverse linear layer as (6) The expression is uniquely determined by two independent variables, i.e., the dialog act a and the latent variable z. Given the same or similar dialog context, there may exist many valid expressions for the responses with the same dialog act a, each corresponding to a certain configuration of z. This representation allows us to express responses with diverse expression templates of specified dialog act by drawing samples from the learned distribution of z. The network p θ (e|a) can be easily computed by setting p θ (a, z) = p θ (a)p θ (z), where a and z are uncorrelated.
Reconstructing Expression. As shown in Figure 1, during training, the expression representation e is first captured from the ground-truth response by the network q φ (e|y). The expression e is decomposed into multiple independent variables, i.e., a and z, and then these variables are composed to reconstruct the expression e by the network p θ (e|a). These variables provide different discourse-level information to force the decoder to focus on multiple global information simultaneously. As shown in Figure 2, different from CVAE, some variables (i.e., a) in HRG are interpretable to make the model controllable while some (i.e., z) are continuous to reflect the latent feature. During testing, the model is controllable by specifying an appropriate dialog act to express the intention.

Expression Attention
There exist two main methods to incorporate the expression representation into the decoder. First, the concatenation of the context and expression representation e is used to initialize the recurrent of the decoder RNN with a nonlinear transformation. During the decoding, the decoder RNN decodes words based on the current state and previous word embeddings w. The second way is that the concatenation of fixed e and w is fed into decoder to update its state at each step. Formally, the first way updates its state only according to [w, 0 · e], while the second according to [w, 1 · e]. However, these methods may cause that the content (expression) is so powerful that the responses are without any effective expression (meaningful content). In this paper, we propose an expression attention (EA) model to attend on different parts of expression representations each step by learning a vector α = {α i ∈ (0, 1)} adaptively. A balance between content and expression is effectively kept by feeding [w, α · e] into decoder. In particular, before decoding, an expression state is initialized as q 0 = e to record the current expression representation. At step t, a strength gate β t is computed based on the input of the previously decoded word y t−1 and the previous decoder state s t−1 . The expression state is weakened by a certain amount β t at each step, where ⊗ is element-wise multiplication and Sigmoid(x) = 1/(1 + exp(−x)). The decoder updates its state conditioned on the previous token y t−1 and the current output f t as follows: It is a dynamic process that the expression is adjusted adaptively according to the current environment and model behaviors compared to the two existing methods above. After step t, a self-update strategy is designed to update the expression state based on the context vector c t (computed by Attention Mechanism (Luong et al., 2015)) and the current decoder state s t . This process is formulated as The expression representation is integrated into the decoder gradually until the expression state decays to zero through multiple iterations. The expression reconstruction and attention models respectively provide discourse-level and token-level randomness respectively, which can avoid the decoder generates the next token only depending on Neural Probabilistic Language Model instead of the dialog context and the current decoding state.

Reconstruction Training
The learnt parameters include the embeddings of vocabulary, and those in the encoder-decoder component and HRG. According to Section 3.1, we first identify two key assumptions that are essential: Both z and a are the indigenous properties of the expression e; The meaning of z is independent of the dialog act a. Based on them, we update the objective Eq.2 as L(θ, φ;y, C, a) = −KL(q φ (z|y)||p θ (z)) − KL(q φ (a|y)||p θ (a)) + E z∼q φ (z|y),a∼q φ (a|y) [log p θ (y|C, a, z)].
Now, the term KL(q φ (z|y)||p θ (z)) is a KLdivergence between two multivariate Gaussian distributions which can be computed in a closed form (Doersch, 2016). Different from z, the dialog act a follows discrete distribution. Minimizing the term KL(q φ (a|y)||p θ (a)) is much simpler than the continuous one, which can be evaluated by where A is the number of dialog acts and = 10 −6 is used to prevent division by zero. We denote the network q φ (a|y) as the act classifier and its probability is evaluated by sof tmax(W f CNN(y) + b f ). As shown in Figure 1, note here that the inverse linear layer shares the same parameters W f and b f with those in q φ (a|y).
In the training process, by introducing the re-parameterization trick (Kingma and Welling, 2014), we obtain the variables z and a from Rec-Net q φ (z|y) and act classifier q φ (a|y), and then feed them into the inverse linear layer to capture expression representations. During testing, by specifying a dialog act a, the decoder generates the response according to the sample from the PriNet p θ (z).

Implementation Details
The dataset DailyDialog Corpus (Li et al., 2017b) is used to evaluate the proposed model. It contains 13,118 multi-turn human-human dialogs annotated with dialog acts and emotions, and covers 10 main topics about daily life. In this Corpus, the dialog act categories are {Inform, Question, Directive, Commissive}. In our experiments, HRG is combined into HRED model  as the expression-aware chatting machine (ECM). PyTorch 1 is used to implement the proposed model. All the RNN modules have 2-layer gated recurrent units (GRU) (Cho et al., 2014) structures with 500 hidden cells for each layer and are set with different parameters. Word embedding has size 300 and is initialized from Glove embedding 2 . The size of the latent variable z is set to be 300. The maximum dialog turn is 5 (10 utterances).
The models are trained end-to-end using Adam optimizer (Kingma and Ba, 2015) with batch size of 30, learning rate of 0.001 and gradient clipping at 50. To overcome the latent variable vanishing problem in CVAE, we use the heuristic method (Bowman et al., 2016) to encode the meaningful information in z. That is, we multiply the first KL term in Eq.12 by a scalar, which starts at 0 and linearly increases to 1 over the first 10,000 batches.

Baselines
We compare our hierarchical model with two popular baselines: (1) HAE: a HRED-based model which we design to embed the dialog act information in the decoder; (2) kgCVAE: a knowledgeguided model that introduces dialog acts to guide the learning of the CVAEs (Zhao et al., 2017). A variant of the proposed model is implemented to verify the effectiveness of expression attention (EA) model. We denote the model without EA as w/o EA. The hyperparameters of the baselines and variants are the same as ECM.
As for HAE, we initialize the embedding of dialog acts using three different methods: (1) RD (random): initializing the embedding randomly; (2) LG (logic-related): training a Skip-Gram model (Mikolov et al., 2013) to maximize the co-occurrence probability among the acts that appear within a window, w, in the sequence of dialog acts for each dialog (set w to 1); (3) CT (content-related): training an act classifier q φ (a|y) with the pairs of utterances and dialog acts in training set, and use each row in W f as the embeddings. The size of act embeddings is set to 300, as the same with the output of EA. The concatenation of dialog act embedding and the previous word embedding is fed into the decoder of HRED to update its state at each step during decoding. HAE is trained to minimize the standard cross entropy loss of the decoder RNN model without any auxiliary loss.

Quantitative Analysis
Automatically evaluating the quality of the dialog model remains an open question. To evaluate how semantically relevant the response is, we report the results for three word embedding-based similarity metrics proposed by Liu et al. (2016): Greedy Matching (GDY), Embedding Average (AVG) and Vector Extrema (EXT). To evaluate whether the response follows the dialog act, we adopt act accuracy (ACC) as the agreement between the groundtruth dialog act and the dialog act predicted by an act classifier. We trained the act classifier and its precision and average recall in the testing set are 83.4% and 74.3% respectively.
In addition to automatic evaluation metrics, a manual evaluation metric (MUL) is also given to evaluate both the response content and expression, where three workers are employed to score a response in terms of Content (rating scale is 0, 1,2) and Act (rating scale is 0,1). Content is evaluated based on whether the response is appropriate and natural to the dialog context, while Act based on whether the expression agrees with the groundtruth act. Content rating is a widely accepted metric proposed by Shang et al. (2015). And, the workers can easily evaluate Act rating based on the context since the number of acts is few in our experiment.
During testing, to efficiently measure output diversity, we generate N responses from HAE models by introducing beam search. For kgCVAE and ECM, we sample N times from the latent variable and only use greedy decoders. Meanwhile, for HAEs and ECM, we specify a as the act of the ground-truth response.
The automatic evaluation metrics focus on comparing the generated responses r j with the ground truth g i of the conversation. We compute the scores of models based on all the M test samples as follows: where d(·) is one of automatic metrics described above, and N is empirically set to 10. Note here that the maximum metric in Eq.15 is more appropriate to measure the output diversity than average one. This is because that taking average metrics may cause that the safety responses get higher scores than meaningful and diverse responses if most of these valid responses are not related to the ground-truth. The maximum metric can greatly reduce the error by increasing the number of samples. the topic compared to HAE models and kgCVAE. ECM obtains higher act accuracy score than HAE as well since the second KL term of Eq.(12) forces the predicted act distribution to approximate the ground-truth. ECM without EA (w/o EA) achieves the best performance in act accuracy but poor performance in embedding-based similarity metrics. It indicates that EA is an efficient model to balance the expression and the content dynamically.
On the other hand, HAE-CT gets higher scores both in embedding-based metrics and in accuracy than other HAE models, which suggests that the act classifier can preserve act-related information effectively. Note here that the act accuracy of kgC-VAE is not given because the response act is an internal parameter predicted by the dialog context rather than an input during testing. Compared to ECM, kgCVAE may give the decoder a wrong direction to approximate the ground-truth responses with different dialog acts.  Table 2: Manual evaluation result. The percentage of responses with the ratings of Content-Act. For instance, 2-1 means Content rating is 2 and Act rating is 1. Table 2 shows the manual evaluation result where the content and expression are considered simultaneously. As we can see, responses generated by w/o EA tend to contain obvious act information but a little of content, while HAE generates the responses with lower scores of Content-Act. Compared to other methods, responses generated by ECM keep a good balance between Content and Act. In our experiment, we also find that HAE-CT still faces serious safe response problems. However, EA provides token-level randomness to avoid that the decoder generates the next token only depending on Neural Probabilistic Language Model. Discussion. ECM performs well than w/o EA not in automatic evaluation but also in manual evaluation obviously. Although ECM includes more trainable parameters in EA than w/o EA, the improvement of performance is mainly due to the effective architecture of EA. The EA model only involves two learnt matrices, i.e, W s and W u , described in Section 3.2. Compared to the parameters in the multi-layer HRED, expression reconstruction, and the embedding of vocabulary, the number of parameters in EA can be ignored. Therefore, it can be concluded that EA plays a key role in improving output diversity with few parameters due to its efficient architecture. Table 3 shows the responses generated by kgC-VAE and ECM. In Example 1, speaker "A" begins with an open domain demand (directive). ECM generated highly diverse answers that cover multiple dialog acts which were fed into the model in advance during decoding. Further, we notice that the generated response with inform act (i.e., sample 1) has similar expression with the ground-truth one, implying that the latent z is able to capture the expression-sensitive variations. It verifies the effectiveness of the hierarchical generation process. ECM can obtain effective expression representation, and fill it with appropriate content obtained from the dialog context. Example 2 is a situation where the waiter "A" tells the customer "B" that the order has done. ECM takes the directive act as input and generate multiple responses to give "A" some suggestions (or commands). All the responses reflect the similar behaviors with different expression styles. On the contrary, kgCVAE is capable of generating some diverse responses, but cannot accurately understand the intention of "A" and thus the responses lack of coherency. The human-human dialogues in the dataset follow some dialog flow patterns, such as Question-Inform, Directive-Commissive (Li et al., 2017b). kgCVAE predicts the dialog act exactly in example 1 but wrongly in Example 2 since the pattern Inform-Directive is not common.

Qualitative Analysis
In our work, a CNN module is leveraged to filter the content-related information of utterances and get a discourse-level representation, i.e. expression vector, where meaningful expression information is preserved. CNN models have been shown to be efficient for NLP and have achieved excellent results in sentence modeling and classification. So we conjecture that the expression vectors are highly correlated with the dialog acts, and each one reflects a concrete expression representation of the specified dialog act. Figure 3 visualizes the expression vectors in the test dataset in 2D space using t-SNE (Der Maaten and Hinton, 2008). We find that the expression vectors are clustered into meaningful groups associated with the dialog acts, which confirms that CNN is an efficient tool to extract the expression information.

Related Works
Vanilla Seq2Seq model usually ends up with generic and dull responses. To tackle this problem, one line of research has focused on forcing the model to imitate some human's skills by augmenting the input with rich meta information. For example, some works separately gave chatbots the ability of emotions (Zhou et al., 2018), persona (Li et al., 2016b), vision (Huber et al., 2018;Wu et al., 2018) and thinking over the knowledge base Zhu et al., 2017). In this work, we consider open domain dialogue generation with dialog acts. But, only a little works (Zhao et al., 2017;Serban et al., 2017a) on open domain endto-end modeling take dialog acts into account.
On the other hand, many attempts have also been made to improve the architecture of Seq2Seq models by changing the training methods. Li et al. (2016a) attributed safe response problems to the use of MLE objective. Some works separately attempted to replace the MLE method with maximum mutual information (Li et al., 2016a), reinforcement learning Li et al., 2016c) and adversarial learning (Xu et al., 2017;Li et al., 2017a). Serban et al. (2017b) viewed the dialog context as prior knowledge and combined HRED model into the CVAE framework. Zhao et al. (2017) further introduced dialog acts to guide the learning of CVAE. In our paper, we use CVAE to learn the hierarchy generation model.

Conclusion and Future Work
In this paper, we investigate the problem of generating meaningful responses by imitating the hierarchical process of human response. Specifically, a hierarchical response generation model is proposed to hierarchically generate the expressions and fill them with appropriate content naturally and coherently. The experiment results show that HRED model equipped with HRG can generate responses appropriate not only in content but also in expression. Different from existing works, our model is interpretable and controllable.
In the future work, we will explore the act interactions with HRG. Instead of specifying a dialog act manually, the most appropriate one can be decided automatically.