Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity

We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity.


Introduction
End-to-end neural response generation methods are promising for developing open domain dialogue systems as they allow to learn from very large unlabeled datasets (Shang et al., 2015;Sordoni et al., 2015;Vinyals and Le, 2015). However, these models have also been shown to generate generic, uninformative, and non-coherent replies (e.g., "I don't know." in Figure 1), mainly due to the fact that neural systems tend to settle for the most frequent options, thus penalizing length and favoring high-frequency word sequences (Sountsov and Sarawagi, 2016;. To address these problems, Li et al. (2016a) and Li et al. (2017a) attempt to promote diversity by improving the objective function, but do not model diversity explicitly. Serban et al. (2017) focus on model structure without any upgrades to the objective function. Other works control the style of the output by leveraging external resources (Hu et al. (2017): sentiment classifier, time annotation; : dialogue acts) or focus on wellstructured input such as paragraphs (Li and Jurafsky, 2017). This paper extends previous attempts to model diversity and coherence by enhancing all three aspects of the learning process: the data, the model, and the objective function. While previous research has addressed these aspects individually, this paper is the first to address all three in a unified framework. Instead of using existing linguistic knowledge or labeled datasets, we aim to control for coherence by learning directly from data, using a fully unsupervised approach. This is also the first work encoding and evaluating coherence explicitly in the dialogue generation task, as opposed to using diversity, style, or other properties of responses as a proxy.
In this work, given a dialogue history, we regard as a coherent response an utterance that is thematically correlated and naturally continuing from the previous turns, as well as lexically diverse. For example, in Figure 1 the response "Specifically the stove." is a very natural and coherent response, elaborating on the topic of kitchen introduced in the previous two utterances and containing rich thematic words, whereas the response "Let's go for a walk." is unrelated and uninteresting.
In order to obtain coherent responses, we present three generic enhancements to existing encoder-decoder (E-D) models: 1. We define a measure of coherence simply as the averaged word embedding similarity between the words of the context and the response computed using GloVe vectors (Pennington et al., 2014).
2. We filter a corpus of conversations based on our measure of coherence, which leaves us with context-response pairs that are both topically coherent and lexically diverse.
3. We train an E-D generator recast as a conditional Variational Autoencoder (cVAE;  model that incorporates two latent variables, one for encoding the context and another for conditioning on the measure of coherence, trained jointly as in Hu et al. (2017). We then decode using a context gate (Tu et al., 2017) to control the generation of words that directly relate to the most topical words of the context and promote coherence.
Experiments on the OpenSubtitles (Lison and Meena, 2016) corpus demonstrate the effectiveness of the overall approach. Our models achieve a substantial improvement over competitive neural models. We provide an ablation analysis, quantifying the contributions that come from effective modeling of coherence into our models. All our experimental code is freely available on GitHub. 1

Coherence-based Dialogue Generation
Our model aims to generate responses given a dialogue context, incorporating measures of coherence estimated purely from the training data. We propose the following enhancements to the attention-based E-D architecture (Bahdanau et al., 2015;Luong et al., 2015): • We introduce a stochastic latent variable z conditioned on previous dialogue context to store the global information about the conversation (Bowman et al., 2016;Chung et al., 2015;Li and Jurafsky, 2017;Hu et al., 2017).
1 https://github.com/XinnuoXu/CVAE_Dial • We force the model to condition on the measure of coherence explicitly by encoding a latent variable (code) c learned from data.
• We incorporate a context gate (Tu et al., 2017) that dynamically controls the ratio at which the generated words in the response derive directly from the coherence-enhanced dialogue context or the previously generated parts of the response.
In the rest of this section, we introduce the measure of coherence (Section 2.1), we present an overview of our model (Section 2.2), and finally describe the model in detail (Sections 2.3-2.4).

Measure of Dialogue Coherence
Semantic vector space models of language represent each word with a real-valued word embedding vector (Pennington et al., 2014). By simply taking a weighted average of all its word embeddings, a whole sentence can be mapped into the semantic vector space. We define the coherence of a dialogue as the average distance between semantic vectors of preceding dialogue context and its response.
. . x J } represent a dialogue context and y = {y 1 , . . . y i , . . . y I } a response. J and I are the numbers of words in the dialogue context and its response, respectively. Semantic vector space models map each word x j into embeddings x emb j , and y i into y emb i . The semantic representation of a dialogue context x is then x emb = J j=1 w j x emb j ; for a response y, it is Here, w j and v i are importance weights for each word in the sentence. 2 The measure of coherence is then defined as the cosine distance of the two semantic vectors of the dialogue context and its response:

Model Overview
End-to-end response generation for dialogue can be formalized as follows: Given a dialogue context x, a dialogue generator generates the next utterance y. During the training process, the aim for a dialogue generator is to maximize the probability p (y|x) over the training dataset. To encode dialogue contexts that adequately incorporate coherence information, we build our generator based on the cVAE model of Hu et al. (2017), which has been used to control text generation with respect to linguistic properties, such as tense or sentiment. In our model, the response y is generated conditioned on the previous conversation x, a diversitypromoting latent variable z, and a latent variable c indicating dialogue coherence; z and c are independent. The generation probability p (y|x) is defined as: Unfortunately, optimizing Eq (2) during training is intractable; therefore, we apply variational inference and optimize instead the variational lower bound: where p (y|x, z, c) is the probability of generating utterance y given x, z and c; q (z|x, y) stands for the approximate posterior distribution of the latent variable z conditioned on dialogue context x and the gold response y; p (c|x, y) is the measure of coherence between context x and response y; p (z|x) is the true prior distribution of z conditioned only on dialogue context x; D KL (·|·) denotes the KL-divergence. We assume that both q (z|x, y) and p (z|x) are Gaussian with mean vectors µ appr , µ true and covariance matrices Σ appr , Σ true .

Model Details
Optimizing Eq (3) consists of two parts: (1) minimizing the KL-divergence between the approximate posterior distribution and the true prior distribution of z, (2) maximizing the probability of generating the gold response y conditioned on dialogue context x and coherence factors z and c. Figure 2 shows the pipeline of the training procedure.
Encoder: First, we encode a dialogue context x into a hidden state h using the context encoder, which is based on Recurrent Neural Networks (RNNs). Then the posterior network encodes both dialogue context x and gold response y into a hidden state h appr followed by two linear transformations f appr (·) and g appr (·) to map h appr into mean vector µ appr and covariance matrix Σ appr . The latent variable z can be sampled from the distribution N (µ appr , Σ appr ): The prior network in Figure 2 takes a form similar to the posterior network: where h true is the final hidden state of an RNN encoding only the dialogue context x, and f true (·), g true (·) are linear transformations. Code c is given by the coherence measure from Eq (1).

Decoder:
We build an attention-based decoder (Bahdanau et al., 2015;Luong et al., 2015) using RNNs to generate responses conditioned on encoded dialogue context h, diversity signal z, and coherence signal c. We concatenate the latent variables z and c to the context encoder hidden state h and feed them into the decoder as the initial hidden state s 0 , similar to Hu et al. (2017). During the decoding process, tokens are generated sequentially under the following probability distribution: where I is the length of the produced response; g (·) is an RNN; s i is the hidden state of the decoder at time step i which is conditioned on the previously generated token y i−1 , the previous hidden state s i−1 , and the weighted attention vector Figure 2: The training process of the generative model. First, the dialogue context is encoded: h is the final hidden state of the context encoder. Then we derive the diversity-promoting latent variable z. Next, we compute the latent variable c that corresponds to the measure of coherence between the dialogue context x and the generated response y. We concatenate all three vectors into s to feed the decoder. a is the attention matrix calculated for every time step of the decoding process. a i : where J is the number of tokens of the dialogue context; h i is the i th hidden state of the encoder; the attention weight w ij of each context hidden state h i is computed following Luong et al. (2015).
Context Gate: To increase the influence of code c, we introduce the context gate k. Unlike Tu et al. (2017), whose context gate assigns an elementwise weight to the input signal deriving from the encoder RNN, we build the context gate conditioned only on the coherence signal: where σ is the sigmoid function; λ is a bias term; 3 c is the target value of the measure of coherence, calculated by C (x, y) (see Section 2.1); c i is the measure of coherence between the dialogue context and the generated prefix sentence at time step i, calculated by C x, y <i . Now Eq (7) with the context gate applied to s i can be rewritten as: where • denotes element-wise multiplication. The coherence-informed context gate aims to dynamically control the ratio at which preceding dialogue context and previously generated tokens of the current response contribute to the generation of the next token in the response. 3 We set λ empirically against the development set.

Training
Our generator is trained similarly to Hu et al. (2017). The objective function is a weighted combination of three losses (generation, coherence, and diversity): To teach the generator to produce responses close to the training data, we maximize the generation probability of the training response log p (y|x) given the dialogue context according to Eq (2). During training, we set L G = − log p (y|x) and minimize the following: Apart from the generation loss, the coherence measure provides an extra learning signal L c which pushes the generator to produce responses that match the coherence signal given by the latent variable c.
In Eq (13), p (c) = N (0, 1) is the prior distribution of the coherence variable c. To ensure that the loss is differentiable, we cannot sample words from the response vocabulary. Instead we define G (x, z, c) = y s = {y s 1 , . . . y s i , . . . y s I } as the sequence of output word probability distributions. p c|x, G (x, z, c) is predicted by the coherence measure defined in Eq (1) with y emb set as: where M glv is the word embedding matrix trained using GloVe (Section 2.1). The last component in Eq (11) is the independent constraint L z that forces the soft distribution over the generated response G to be diverse, so that it is able to faithfully reproduce the latent variable z: where q z|x, G (x, z, c) is predicted by the posterior network with y s j as the soft input to the RNN encoder at each time step j. Figure 3 shows the inference process of the generative model. Given a dialogue context x and an expected coherence value c, the context encoder first encodes the dialogue context into a hidden state h. The prior network then generates a sample z conditioned on the dialogue context. The decoder is initialized with s, i.e., the concatenation of h, z and c. During decoding, the next word is generated via the context gate modulating between the attention-reweighted context and the previously generated words of the response.

Dataset for Generator
We train and evaluate our models on the OpenSubtitles corpus (Lison and Tiedemann, 2016) with automatic dialogue turn segmentation (Lison and Meena, 2016). 4 A training pair consists of a dialogue context and a corresponding response. We consider three consecutive turns as the dialogue context and the following turn as the response. From a total of 65M instances, we select those that have context and response lengths of less than 120 and 30 words, respectively. We create two datasets: coherence score C (x, y) ≥ 0.68. 5 Filtering of the OpenSubtitles corpus is motivated by the fact that by removing the video and audio modalities which the subtitles originally accompanied, we are very often left with incomplete and incoherent dialogues. Therefore, by keeping dialogues with high coherence scores, we aim at building a high quality corpus with (1) more semantically coherent and topically related contexts and responses, and (2) fewer general and dull responses. Table 3 shows the coherence and diversity metrics (cf. Section 4.2) between OST and fOST. Unsurprisingly, coherence for fOST is much higher than OST, with a slightly higher diversity. We list dialogue examples for different coherence scores in Supplemental Material B.

Dataset for Coherence Measure
In order to accurately measure coherence on our domain using the semantic distance as defined in Section 2.1, we train GloVe embeddings on the full OpenSubtitles corpus (i.e. 100K movies).

Experiments
Our generator model, ablative variants, and baselines are implemented using the publicly available OpenNMT-py framework (Klein et al., 2017) based on Bahdanau et al. (2015) and Luong et al. (2015). We used the publicly available glovepython package 8 to implement our coherence measure.
We experiment on two versions of our model: (1) cVAE with the coherence context gate as described in Section 2.3 (cVAE-XGate), (2) cVAE with the original context gate implementation of 5 The coherence score is calculated as shown in Eq (1). We observed that the scores on the training set follow a normal distribution with a slight tail on the negatively correlated side, so we fit a normal distribution to the data with parameters N (0.25, 0.22) and set the cut-off to +2σ. A histogram of coherence scores is shown in Figure 5 in Supplemental Material A. 7 Note that Distinct-1 and Distinct-2 are computed on a randomly selected subsets of 4k responses. 8 https://github.com/maciejkula/ glove-python Figure 3: The inference process of the generative model, where the latent variable c is given as an input. (Tu et al., 2017) (cVAE-CGate). For each of these, we consider the main variant where the input coherence measure c is preset to a fixed ideal value as estimated on development data (1.0 for OST and 0.95 for fOST), as well as an oracle variant where we use the true coherence measure between the context and the gold-standard response in the test set (indicated with "(C)" in Tables 2 and 3).

Parameter Settings
We set our model parameters based on preliminary experiments on the development data.
We use 2-layer RNNs with LSTM cells (Hochreiter and Schmidhuber, 1997) with input/hidden dimension of 128 for both the context encoder and the decoder. The dropout rate is set to 0.2 and the Adam optimizer (Kingma and Ba, 2015) is used to update the parameters. A vocabulary of 25,000 words is shared between the encoder and the decoder.
Both the posterior network and prior network for the latent variable learning are built with 2layer LSTM RNNs with input/hidden dimension of 64. The dimension of the latent variable z is set to 20. Same as for the encoder and decoder, the dropout rate is 0.2 and the Adam optimizer is used to update the parameters.
The window size for GloVe computation in our coherence measure is set to 10.

Evaluation metrics
We use a number of metrics to evaluate the outputs of our models: • BLEU, B1, B2, B3 -the word-overlap score against gold-standard responses (Papineni et al., 2002) used by the vast majority of recent dialogue generation works Yao et al., 2017;Li et al., 2017aLi et al., , 2016cSordoni et al., 2015;Li et al., 2016a;Ghazvininejad et al., 2017). BLEU in this paper refers to the default BLEU-4, but we also report on lower n-gram scores (B1, B2, B3). 9 • Coh -our novel GloVe-based coherence score calculated using Eq (1) showing the semantic distance of dialogue contexts and generated responses. • D-1, D-2, D-Sent -common metrics used to evaluate the diversity of generated responses (e.g. Li et al., 2016a;Xu et al., 2017;Dhingra et al., 2017): the proportion of distinct unigrams, bigrams, and sentences in the outputs.

Results
All model variants described in Section 4 are trained on both OST and fOST datasets. Tables 2 and 3 present the scores of all models tested on the OST and fOST test sets, respectively. Note that in addition to testing the models on the respective test sections of their training datasets, we also test them on the other dataset (OST-trained models on fOST and vice-versa). This way, we can observe the performance of the fOST-trained models in more noisy contexts and see how good the OSTtrained models are when evaluated against coherent responses only. Given all the evaluated model variants, we can observe the effects and contributions of the individual components of our setup: • Data filtering: The models trained on fOST consistently outperform the same models trained on OST -for all evaluation metrics and on both test sets. This shows that coherencebased training data filtering is generally beneficial.   • cVAE-Context Gate models: Nearly all cVAEbased models perform markedly better than the baselines w.r.t. BLEU, coherence, and diversity. 10 If we look at models trained on OST and tested on fOST (the top half of Table 3), we can see that all cVAE-based models, especially cVAE-XGate, are able to learn to produce coherent and diverse response even when trained on a noisy, incoherent corpus. Examples of responses generated by the baseline MMI model and by cVAE-XGate in Figure 4 show that cVAE-XGate mostly produces more diverse and coherent responses than MMI.
• Preset c vs. oracle models with gold-standard c: Table 2 shows that on the noisy OST test set, cVAE-based models using the gold-standard 10 We performed paired bootstrap re-sampling for the best cVAE model and the best baseline model in each experiments set (Table 2 and Table 3) as is done for MT (Koehn, 2004), which confirmed statistical significance at 99% confidence level for all cases except for models trained on fOST and tested on OST (bottom half of Table 2). value of c achieve higher BLEU scores than models using preset c. This is expected since many gold-standard responses in the unfiltered set have a low coherence score; -the model can generate a more generic response if the gold-standard c is low. The models with preset c always attempt to generate coherent responses, which is apparent from the other metrics: Coh and D-Sent are consistently higher than for models using gold-standard c.
On the fOST test set where only high-coherence responses are expected, models using fixed c consistently reach higher scores in all metrics including BLEU (see Table 3). This shows that in general, using a preset constant value of c works well, even better than using the goldstandard c.
In sum, using our coherence measure both for data filtering and inside the models leads to output performance improvements.  . B-GT is the ground-truth response from the test set. The three sequential dialog turns on the left are the preceding dialogue context used to generate the responses. Corresponding topical phrases are underlined. We can see that cVAE-XGate (1.0) mostly produces markedly more coherent and specific outputs than MMI (1-5). In some cases, it is comparable with MMI (6-7) and occasionally, it is less coherent (8).

Related Work
Our work fits into the context of the very active area of end-to-end generative conversation models, where neural E-D approaches have been first applied by Vinyals and Le (2015) and extended by many others since. Many works address the lack of diversity and coherence in E-D outputs (Sountsov and Sarawagi, 2016; but do not attempt to model coherence directly, unlike our work: Li et al. (2016a) use anti-LM reranking; Li et al. (2016c) modify the beam search decoding algorithm, similar to Shao et al. (2017) in addition to using a self-attention model. Mou et al. (2016) predict keywords for the output in a preprocessing step while Wu et al. (2018) preselect a vocabulary subset to be used for decoding. Li et al. (2016b) focus specifically on personality generation (using personality embeddings) and  promote topic-specific outputs by language-model rescoring and sampling.
A lot of recent works explore the use of additional training signals and VAE setups in dialogue generation. In contrast to this paper, they do not focus explicitly on coherence: Asghar et al. (2017) use reinforcement learning with human-provided feedback, Li et al. (2017a) use a RL scenario with length as reward signal. Li et al. (2017b) add an adversarial discriminator to provide RL rewards (discriminating between human and machine outputs), Xu et al. (2017) use a full adversarial training setup. The most recent works explore the usage of VAEs: Cao and Clark (2017)  We also draw on ideas from other areas than dialogue generation to build our models: Tu et al.

Conclusions and Future Work
We showed that explicitly modeling coherence and optimizing towards coherence and diversity leads to better-quality outputs in dialogue response generation. We introduced three extensions to current encoder-decoder response generation models: (1) we defined a measure of coherence based on GloVe embeddings (Pennington et al., 2014), (2) we filtered the OpenSubtitles training corpus (Lison and Meena, 2016) based on this measure to obtain coherent and diverse training instances, (3) we trained a cVAE model based on (Hu et al., 2017) and (Tu et al., 2017) that uses our coherence measure as one of the training signals. Our experimental results showed a considerable improvement in the output quality over competitive models, which demonstrates the effectiveness of our approach.
In future work, we plan to replace the GloVebased measure of coherence with a trained discriminator that distinguishes between coherent and incoherent responses (Li and Jurafsky, 2017). This will allow us to use extend the notion of coherence to account for phenomena such as topic shifts. We also plan to verify the results with a human evaluation study.