Modeling Semantic Relationship in Multi-turn Conversations with Hierarchical Latent Variables

Multi-turn conversations consist of complex semantic structures, and it is still a challenge to generate coherent and diverse responses given previous utterances. It’s practical that a conversation takes place under a background, meanwhile, the query and response are usually most related and they are consistent in topic but also different in content. However, little work focuses on such hierarchical relationship among utterances. To address this problem, we propose a Conversational Semantic Relationship RNN (CSRR) model to construct the dependency explicitly. The model contains latent variables in three hierarchies. The discourse-level one captures the global background, the pair-level one stands for the common topic information between query and response, and the utterance-level ones try to represent differences in content. Experimental results show that our model significantly improves the quality of responses in terms of fluency, coherence, and diversity compared to baseline methods.


Introduction
Inspired by the observation that real-world human conversations are usually multi-turn, some studies have focused on multi-turn conversations and taken context (history utterances in previous turns) into account for response generation. How to model the relationship between the response and context is essential to generate coherent and logical conversations. Currently, the researchers employ some hierarchical architectures to model the relationship.  use a context RNN to integrate historical information, Tian et al. (2017) sum up all utterances weighted by the similarity score between an utterance and the query, while Zhang et al. (2018) apply attention mechanism on history utterances. Besides, Xing et al. (2018) add a word-level attention to capture finegrained features.
In practice, we usually need to understand the meaning of utterances and capture their semantic dependency, not just word-level alignments (Luo et al., 2018). As shown in Table 1, this short conversation is about speaker A asks the current situation of speaker B. At the beginning, they talk about B's position. Then in the last two utterances, both speakers think about the way for B to come back. A mentions "umbrella", while B wants A to "pick him/her up". What's more, there is no "word-to-word" matching in query and response. Unfortunately, the aforementioned hierarchical architectures do not model the meaning of each utterance explicitly and has to summarize the meaning of utterances on the fly during generating the response, and hence there is no guarantee that the inferred meaning is adequate to the original utterance. To address this problem, variational autoencoders (VAEs) (Kingma and Welling, 2014) are introduced to learn the meaning of utterances explicitly and a reconstruction loss is employed to make sure the learned meaning is faithful to the corresponding utterance. Besides, more variations are imported into utterance level to help generate more diverse responses.  However, all these frameworks ignore the practical situation that a conversation usually takes place under a background with two speakers communicating interactively and query is the most relevant utterance to the response. Hence we need to pay more attention to the relationship between query and response. To generate a coherent and engaging conversation, query and response should be consistent in topic and have some differences in content, the logical connection between which makes sure the conversation can go on smoothly.
On these grounds, we propose a novel Conversational Semantic Relationship RNN (CSRR) to explicitly learn the semantic dependency in multiturn conversations. CSRR employs hierarchical latent variables based on VAEs to represent the meaning of utterances and meanwhile learns the relationship between query and response. Specifically, CSRR draws the background of the conversation with a discourse-level latent variable and then models the consistent semantics between query and response, e.g. the topic, with a common latent variable shared by the query and response pair, and finally models the specific meaning of the query and the response with a certain latent variable for each of them to capture the content difference. With these latent variables, we can learn the relationship between utterances hierarchically, especially the logical connection between the query and response. What is the most important, the latent variables are constrained to reconstruct the original utterances according to the hierarchical structure we define, making sure the semantics flow through the latent variables without any loss. Experimental results on two public datasets show that our model outperforms baseline methods in generating high-quality responses.

Approach
Given n input messages {u t } n−1 t=0 , we consider the last one u n−1 as query and others as context. u n denotes corresponding response.
The proposed model is shown in Figure 1. We add latent variables in three hierarchies to HRED . z c is used to control the whole background in which the conversation takes place, z p is for the consistency of topic between query and response pair, z q and z r try to model the content difference in each of them, respectively. For simplicity of equation description, we use n − 1 and n as the substitution of q and r.

Context Representation
Each utterance u t is encoded into a vector v t by a bidirectional GRU (BiGRU), f utt θ : v For the inter-utterance representation, we follow the way proposed by Park et al. (2018), which is calculated as: is the activation function of GRU. z c is the discourse-level latent variable with a standard Gaussian distribution as its prior distribution, that is: For the inference of z c , we use a BiGRU f c to run over all utterance vectors {v t } n t=0 in training set. where MLP(·) is a feed-forward network, and Softplus function is a smooth approximation to the ReLU function and can be used to ensure positiveness (Park et al., 2018;Serban et al., 2017;Chung et al., 2015).

Query-Response Relationship Modeling
According to VAEs, texts can be generated from latent variables (Shen et al., 2017). Motivated by this, we add two kinds of latent variables: pairlevel and also utterance-level ones for query and response.
As depicted in Figure 1, h c n−1 encodes all context information from utterance u 0 to u n−2 . We use z p to model the topic in query and response pair. Under the same topic, there are always some differences in content between query and response, which is represented by z q and z r , respectively. We first define the prior distribution of z p as follows: i=0 , µ n−1 and σ n−1 are calculated as: Since z q (z n−1 ) and z r (z n ) are also under the control of z p , we define the prior distributions of them as: Here, i = n − 1 or n. The means and the diagonal variances are computed as: The posterior distributions are: q φ (·) is a recognition model used to approximate the intractable true posterior distribution. The means and the diagonal variances are defined as: Note that in Equation 16 and 17, both v n−1 and v n are taken into consideration, while Equation 18 and 19 use z p and corresponding v i .

Training
Because of the existence of latent variables in query-response pair, we use decoder f dec θ to generate u n−1 and u n : The training objective is to maximize the following variational lower-bound: Equation 21 consists of two parts: the reconstruction term and KL divergence terms based on three kinds of latent variables.

Experimental Settings
Datasets: We conduct our experiment on Ubuntu Dialog Corpus (Lowe et al., 2015) and Cornell Movie Dialog Corpus (Danescu-Niculescu-Mizil and Lee, 2011). As Cornell Movie Dialog does not provide a separate test set, we randomly split the corpus with the ratio 8:1:1. For each dataset, we keep conversations with more than 3 utterances. The number of multi-turn conversations in train/valid/test set is 898142/19560/18920 for Ubuntu Dialog, and 36004/4501/4501 for Cornell Movie Dialog. Hyper-parameters: In our model and all baselines, Gated Recurrent Unit (GRU) (Cho et al., 2014) is selected as the fundamental cell in encoder and decoder layers, and the hidden dimension is 1,000. We set the word embedding dimension to 500, and all latent variables have a dimension of 100. For optimization, we use Adam  (Park et al., 2018) and KL annealing, the utterance drop ratio equals to 0.25.

Evaluation Design
Open-domain response generation does not have a standard criterion for automatic evaluation, like BLEU (Papineni et al., 2002) for machine translation. Our model is designed to improve the co-  herence/relevance and diversity of generated responses. To measure the performance effectively, we use 5 automatic evaluation metrics along with human evaluation. Average, Greedy and Extrema: Rather than calculating the token-level or n-gram similarity as the perplexity and BLEU, these three metrics are embedding-based and measure the semantic similarity between the words in the generated response and the ground truth (Serban et al., 2017;Liu et al., 2016). We use word2vec embeddings trained on the Google News Corpus 1 in this section. Please refer to Serban et al. (2017) for more details.
Dist-1 and Dist-2: Following the work of Li et al. (2016), we apply Distinct to report the degree of diversity. Dist-1/2 is defined as the ratio of unique uni/bi-grams over all uni/bi-grams in generated responses.
Human Evaluation: Since automatic evaluation results may not be fully consistent with human judgements (Liu et al., 2016), human evaluation is necessary. Inspired by Luo et al. (2018), we use following three criteria. Fluency measures whether the generated responses have grammatical errors. Coherence denotes the semantic consistency and relevance between a response and its context. Informativeness indicates whether the response is meaningful and good at word usage. A general reply should have the lowest Informativeness score. Each of these measurement scores ranges from 1 to 5. We randomly sample 100 examples from test set and generate total 400 responses using models mentioned above. All generated responses are scored by 7 annotators, who are postgraduate students and not involved in other parts of the experiment.

Results of Automatic Evaluation
The left part of Table 2 is about automatic evaluation on test set. The proposed CSRR model significantly outperforms other baselines on three embedding-based metrics on both datasets. The improvement of our model indicates our semantic relationship modeling better reflects the structure of real-world conversations, and the responses generated by our models are more relevant to context. As for diversity, CSRR also gets the highest Dist-1 and Dist-2 scores. For Ubuntu Dialog dataset, VHRED+w.d is the worst. With the help of discourse-level latent variable and utterance drop, VHCR+u.d leads to better performance. However, HRED is the worst on the Cornell Movie dataset. Park et al. (2018) empirically explained the difference based on that Cornell Movie Dialog dataset is small in size, but very diverse and complex in content and style, and models like HRED often fail to generate appropriate responses for the context.

Results of Human Evaluation
The right part of Table 2 is about human evaluation results on 400 (100×4) responses. First, it is clear that CSRR model receives the best evaluation on three aspects, which proves the effectiveness of CSRR on generating high quality responses. Second, because of the existence of discourse-level and pair-level latent variables, responses are more coherent. Since these two kinds of variables learn high level semantic information, utterance-level ones serve better on expression diversion, also improve sentence fluency and informativeness.  easy questions, like greeting (Example 4), both HRED and CSRR perform well. In contrast, VHRED+w.d and VHCR+u.d tend to generate general and meaningless responses. For hard questions, like some technical ones (Example 1 to 3), the proposed CSRR obviously outperforms other baselines. Note that VHCR is to show the effectiveness of z c and it can also be considered as the ablation study of CSRR to illustrate the validity of z p . From above cases, we empirically find that with the help of z p , response generated by CSRR are not only relevant and consistent to context, but also informative and meaningful.

Conclusion and Future Work
In this work, we propose a Conversational Semantic Relationship RNN model to learn the semantic dependency in multi-turn conversations. We ap-ply hierarchical strategy to obtain context information, and add three-hierarchy latent variables to capture semantic relationship. According to automatic evaluation and human evaluation, our model significantly improves the quality of generated responses, especially in coherence, sentence fluency and language diversity.
In the future, we will model the semantic relationship in previous turns, and also import reinforcement learning to control the process of topic changes.