Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints

Neural conversation models tend to generate safe, generic responses for most inputs. This is due to the limitations of likelihood-based decoding objectives in generation tasks with diverse outputs, such as conversation. To address this challenge, we propose a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses. We propose two constraints that help generate more content rich responses that are based on a model of syntax and topics (Griffiths et al., 2005) and semantic similarity (Arora et al., 2016). We evaluate our approach against a variety of competitive baselines, using both automatic metrics and human judgments, showing that our proposed approach generates responses that are much less generic without sacrificing plausibility. A working demo of our code can be found at https://github.com/abaheti95/DC-NeuralConversation.


Introduction
Recent years have seen growing interest in neural generation methods for data-driven conversation. This approach has the potential to leverage massive conversational datasets on the web to learn open-domain dialogue agents, without relying on hand-written rules or manual annotation. Such response generation models could be combined with traditional dialogue systems to enable more natural and adaptive conversation, in addition to new applications such as predictive response suggestion (Kannan et al., 2016), however many challenges remain.
A major drawback of neural conversation generation is that it tends to produce too many "safe" or generic responses, for example: "I don't know" or "What are you talking about ?". This is a pervasive problem that has been independently reported he 's EOS Figure 1: Illustration of the dull response problem in maximum likelihood neural conversation generation using an example from the OpenSubtitles corpus. Function (stop) words tend to receive higher log probabilities than content (topic) words. The highest likelihood stop words and topic words in this context are listed. by multiple research groups (Li et al., 2016a;Li et al., 2016c). 1 The effect is due to the use of conditional likelihood as a decoding objective -maximizing conditional likelihood is a suitable choice for text-to-text generation tasks such as machine translation, where the source and target are semantically equivalent, however, in conversation there are many acceptable ways to respond. Simply choosing most predictable reply often leads to very dull conversation. Figure 1 illustrates the problem with conditional likelihood using an example. After encoding the source message using a bidirectional LSTM with attention, and fixing the first two words of the response, we show the highest ranked words (according to log-likelihood scores) taken from a list of stop words 2 in contrast to those selected from a list of topic words. 3 As illustrated in the figure, response generation that is based on maximum likelihood is biased towards stop-words and therefore results in responses that are safe (likely to be plausible in the context of the input), but also bland (don't contribute any new information to the conversation). This motivates the need for augmenting the decoding objective to encourage the use of more content words.
To address the dull-response problem in neural conversation, in this paper, we propose a new decoding objective that flexibly incorporates sideinformation in the form of distributional constraints. We explore two constraints, one which encourages the distribution over topics and syntax in the response to match that found in the user's input. To estimate these distributions, we leverage the unsupervised model of topics and syntax proposed by Griffiths and Steyvers (2005). The second constraint encourages generated responses to be semantically similar to the user's input; semantic similarity is measured using fixed-dimensional sentence embeddings (Arora et al., 2016).
After introducing distributional constraints into the decoding objective, we empirically demonstrate, in an evaluation that is based on human judgments, that our approach generates more content-rich responses when compared with two competitive baselines: Maximum Mutual Information (MMI) (Li et al., 2016a), in addition to an approach that conditions on topic models as additional context in neural conversation (Xing et al., 2017). While encouraging the model to generate less bland responses can be risky, we find that our approach achieves comparable plausibility while introducing significantly more content.

Neural Conversation Generation
As a starting point for our approach we leverage the Seq2Seq model (Sutskever et al., 2014;Bahdanau et al., 2014) which has been used as a basis for a broad range of recent work on neural conversation (Kannan et al., 2016;Li et al., 2016a;Shao et al., 2017). This model consists of two parts, an encoder and a decoder both of which are typically stacked LSTM layers. The encoder reads the input sequence and creates a hidden representation. The decoder conditions on this representation, using attention, and generates the response using a neural network language model (Bengio et al., 2003;Sutskever et al., 2011).

Distributional Topic and Semantic Similarity Constraints
Neural generation models select a response,Ŷ by maximizing over a decoding objective, typically using greedy beam search from left to right over partially completed responses, which are scored using the decoder RNN language model. A commonly used decoding objective is the conditional likelihood of the target given the source, P (Y |X): As discussed in Section 1, models trained to maximize conditional likelihood tend to assign low probability to content words as compared to (more frequent) function words, leading to bland, generic responses most of the time. To ameliorate this, we introduce distributional constraints in the form of additional terms in the decoding objective that favor hypotheses containing more content words that are similar to the source in the Topical and Semantic sense. For the constraint in the topic domain, we are interested in the topic probability distributions of the source, X, and target Y , P (T |X) and P (T |Y ), where T is a random variable defined over k topics. Then we can modify the decoding objective from Eq 1: Here, ∆ is a similarity function between the two probability distributions and α is a tunable hyperparameter to adjust impact of this constraint.
Much recent work has investigated how to encode the semantic meaning of a sentence into a fixed high dimensional embedding space (Kiros et al., 2015;Wieting and Gimpel, 2017). Given such an embedding representation of X and Y , one can find the semantic similarity between the two and similar to Eq 2 we can add a semantic similarity constraint to the likelihood objective as follows: where, Emb() is a function that maps an utterance to a semantic vector representation, ∇ is a function that computes similarity of the two embeddings and β is a tunable parameter.
Both of the constraint terms from Eq 2 and Eq 3 are additive in nature and thus can be combined in a straightforward fashion. This formulation allows us to systematically combine information from three different models to produce better responses in terms of topic and semantic relevance. Conceptually, the likelihood term governs the grammatical structure of the response while the topic and semantic constraints drive content selection (Nenkova and Passonneau, 2004;Barzilay and Lapata, 2005).

Decoding with Distributional Constraints
In Section 3, we defined two constraints (one topic constraint and one semantic) for use in the decoding objective. Incorporating these constraints during decoding requires that they factorize in a way that is compatible with left-to-right beam search over words in the response. The standard approach to computing posterior distributions in topic models requires a probabilistic inference procedure over the entire source and target. Furthermore, computing semantic representations can involve the use of complex neural architectures. Both of these proceedures are difficult to integrate into decoding, because they are computationally expensive and would need to be called repeatedly within the inner loop of the decoder. Furthermore, when performing left-to-right beam search, as is common practice in neural generation, the complete response is generally not available. To address these challenges, we propose using simple additive variants of these methods that factorize over words and which we found to enable efficient decoding without sacrificing performance.

Topic Similarity
Estimating the topic distribution of the source, P (T |X), and response, P (T |Y ), is a key step in implementing the topic-similarity constraint. HMM-LDA is a generative model that is able to separate topic and syntax words, by inferring topic distributions in a corpus while flexibly modeling function words. We briefly summarize this model before describing our implementation.

Syntax-Topics model
Griffiths et. al. (2005) suggested an unsupervised generative model that simultaneously labels each word in a document with a syntax (c) and topic (z) state. They modify the Latent Dirichlet Allocation model to include a syntactic component akin to a Hidden Markov Model (HMM). In LDA, each topic (z) is associated with a probability distribution over the vocabulary φ (z) . HMM-LDA adds additional distributions over words for each syntactic class (c) as φ (c) . A special class, c = 0, is reserved for topics. The transition model between classes c i−1 to c i follows a multinomial distribution distribution π (c i−1 ) . Each document has an associated distribution over topics θ (d) ; each word, w j , in the document has an associated latent topic variable, z j , that is drawn from θ (d) and Markov Chain Monte Carlo inference (MCMC) is used to infer values for the hidden topic and syntax variables associated with a given document collection. To estimate topic and syntax distributions, we performed collapsed Gibbs sampling over our training corpus of conversations, where each conversation is treated as a document. One sample of the hidden variables was used to estimate model parameters after 2,500 iterations of burn in. Our code for training the HMM-LDA model is available online 4 .

Estimating Topic Distributions with HMM-LDA
To compute distributional topic constraints in neural response generation, we first need an efficient method for estimating topic distributions that factorizes over words, given a point estimate of an HMM-LDA model's parameters. We would like to estimate topic distributions based on content words contained in a sentence and ignore function words. HMM-LDA provides us with topic, φ (z) , and syntax, φ (c) , distributions over the vocabulary of words, w ∈ V . Treating a sentence as a bag-of-words we can estimate its distribution over topics as a sum of topic distributions over all words normalized by sentence length. However, we found this approach does not to work well in practice because it gives equal weight to topic and syntax words. To address this issue, we weighted each word's topic distribution P (T |w) by its probability of being generated by the topic component of the HMM-LDA model (i.e. P (C = 0|w)). The topic distribution of a sentence, S, is estimated as: where Z = Σ w∈S P (C = 0|w) is a normalizing constant that corresponds to the expected number of content words in the sentence. As mentioned earlier, a more accurate estimate of the topic distribution could be obtained using MCMC inference or by applying the forward-backward algorithm. However, these methods are computationally expensive and not well-suited to the decoding framework used in neural generation. The method described above allows us to efficiently compute the topic distribution of a sentence for use in the topic constraint in Eq 2. For a similarity function, ∆, we simply use the vector dot product, which is closely related to cosine similarity. This formulation has the advantage that it enables memoization during decoding. Another advantage is that it captures the ratio of topic to syntax words due to the weights P (C = 0|w). 5 Therefore, the overall constraint has the effect of keeping the syntax-topics ratio in generated hypothesis similar to the source.

Semantic Similarity
To define the semantic similarity constraint we first encode a semantic representation of the source and target into a fixed dimensional embedding space. There are many sentence embedding methods that could be used, however we want this encoding to be relatively efficient as it will be used many times during beam search. Arora et. al. (2016) recently proposed a simple sentence embedding method, which was shown to have competitive performance across a variety of tasks. Their approach uses a weighted average of word embeddings where each word is weighted by a a+P (w) ; here, P (w) is the unigram probability and a is a hyperparameter. Such a weighting scheme reduces the impact of frequent words (typ-ically function words) in the overall sentence embedding. Next the first principal component of all the sentence embeddings in the corpus is removed. (Arora et al., 2016) points that the first principal component has high cosine similarity with common function words. Removing this component gives sentence embeddings that encapsulate the semantic meaning of the sentence. We use this technique in our implementation of Emb() in Eq 3. For the similarity function, ∆, we use the dot product. Analogous to the topic constraint described above, this approach to measuring semantic similarity also decomposes over words and works well in the decoding framework.   Column 2 shows the total number of dialogues that we got after all pre-processing and Column 3 shows the number of sampled dialogues in the test set.

Datasets
For training purposes we use OpenSubtitles (Tiedemann, 2009), a large corpus of movie subtitles (roughly 60M-70M lines) that is freely available and has been used in a broad range of recent work on data-driven conversation. OpenSubtitles does not contain speaker annotations on the dialogue turns, so as previously noted when used for learning data-driven conversation models the data is somewhat noisy. Nonetheless, it is possible to create a useful corpus of conversations from this data by assuming each line corresponds to a full speaker turn. Although this assumption is often violated, prior work has successfully trained and evaluated neural conversation models using this corpus. In our experiments we used a preprocessed version of this dataset distributed by Li et. al. (2016a). 6 The dataset contains large number of two turn dialogues out of which we sampled 23M to use as our training set and 10k as a validation set.
Due to the noisy nature of the OpenSubtitles conversations we do not use them for evaluation. Instead, we leverage the Cornell Movie Dialogue Corpus (Danescu-Niculescu-Mizil and Lee, 2011) which is much smaller but contains accurate speaker annotations. We extracted all two turn conversations (source target pair) from this corpus and removed those with less than three and more than 25 words. After this, we divided the remaining conversations into three buckets based on source length. The numbers can be found in Table 2. From each bucket we randomly sampled ≈333 dialogues for a total of 1000 dialogues in our test set. We evaluate all models on this test set. Since automatic metrics do not correlate with human judgment, we manually tuned the hyperparameters (α and β) on a small development set (4 dialogues from each bucket to create a small 12 sentence development set; disjoint from test set). We manually inspected the responses generated by the model on the development set for different values of α and β and choose those that performed best.

Experimental Conditions and Baselines
During learning we use the same hyperparameters for all models; these are displayed in Table  1, and are based on those reported by Li et. al. (2016a). 7 We compare our approach with the following baselines: MMI: We re-implemented the MMI-bidi method proposed by Li et. al. (2016a). MMI is a particularly appropriate baseline for comparison, as it encourages responses that have higher relevance to the input in contrast to conditional likelihood, which tends to favor responses with higher unconditional probability. MMI-bidi generates B candidates using Beam search on a Seq2Seq model trained to maximize conditional likelihood of the target given the source, P (Y |X), then re-ranks them using a separately trained source given tar-get model, P (X|Y ). Combining both directions in this way has the effect of maximizing mutual information (Li et al., 2016a). TA-Seq2Seq: Another relevant baseline is the TA-Seq2Seq model of Xing et. al. (2017) that integrates information from a pre-trained topic model into neural response generation using an attention mechanism to condition on relevant topic words. They evaluate their model on a dataset of Chinese forum posts. Unfortunately we could not use the code provided by the authors due to datamismatch (their model makes use of user identities which are not available in the OpenSubtitles corpus). We therefore compare with a reimplementation of their approach in which we modify each source sentence to include a list of the 20 most relevant topic words from HMM-LDA and then train using the same Seq2Seq framework with attention. This enables the model to condition on the relevant topic words. In addition to incorporating attention over topics, Xing et. al. also introduced an approach to biased generation -to replicate this we add a constant factor to all topic words during the prediction.

Results and Analysis
Our proposed decoding objective constraints (topic and semantic) are complementary to the MMI objective, which encourages diversity and relevance to the source input. Therefore, in addition to comparing against the baselines described above, we evaluated three variants of our model: (1) maximum conditional likelihood combined with semantic and topic distributional constraints with a beam size of 10 (DC-10) (2) The same configuration with MMI-bidi re-ranking using a beam size of 10 DC-MMI10 and (3) MMI-bidi reranking with a beam size of 200 (DC-MMI200). We test all configurations on the 1000 conversations test set described in Section 5 and compare them on automatic metrics and also in a crowdsourced human evaluation. We do not consider TA-200 (TA-Seq2Seq, Beam=200), DC-200 and MMI-10 for human evaluation as they appear to perform worse than other model variants in automatic metrics and also on our set of development sentences. Sample responses for all the remaining models are presented in Table 3.

Automatic Metrics
Following Li et. al. (2016a), we report distinct-1 and distinct-2, which measure the diversity of re-  Table 3: Sample responses of all the models on the dev set sponses. These are the ratios of types to tokens for unigrams and bigrams, respectively. We also report BLEU-1 scores following previous work, however it should be noted that BLEU-1 is not generally accepted to correlate with human judgments in conversation generation tasks (Liu et al., 2016) as there are many acceptable ways to reply to an input which may not match a reference response. Lastly, we compare the percentage of stop-words 8 of the responses generated by each model (smaller values, that are closer to the distribution of human conversations are preferred). The automatic evaluation is presented in Table 4.
For brevity we define aliases for each system in the 2 nd column of Table 4 which are used in subsequent discussion. The human responses are diverse and also generally longer than automatically generated responses. MMI200 has higher diversity than TA-Seq2Seq in terms of distinct-1 and distinct-2. This illustrates the importance of re-ranking using MMI. Our approach produces almost twice as many distinct unigrams and bigrams. We also observe MMI200 and TA-Seq2Seq achieve higher BLEU scores than our models, however this is not surprising since our models are designed to generate more interesting responses containing rarer content words that are less likely to appear in reference responses.  Table 4: Automatic metrics evaluation. The 3 rd and 4 th columns show the ratio of types to tokens for unigrams and bigrams respectively. 7 th Column shows the % of stop-words generated by the models in their responses.  Table 5: Human judgments for Plausibility of the different models. Each numerical cell contains a percentage value corresponding to its row truncated to 2 decimal precision.
As expected we observe that MMI200 and TA-10 have a higher percentage of stop-words than human responses. According to the human evaluation discussed in Section 7.2, these models were also found to have lower content richness.

Human Evaluation
We conducted a survey on the crowd-sourcing platform, Amazon Mechanical Turk. Every model response is scored on 2 categories: 1) Plausibility is the response plausible for the given source? and 2) Content Richness -does the response add new information to the conversation? We asked the evaluators to respond on a 5-point scale to the questions above (Strongly Agree, Agree, Unsure, Disagree, Strongly Disagree). These were later collapsed to 3 categories (Agree, Unsure, Disagree). The results for plausibility and content richness of our model in addition to the MMI and TA-Seq2Seq baselines and human responses are presented in Table 5. We observe that MMI200 and TA-10 models  Table 6: Comparing the model variation by reducing beam size to 10 and also comparing decoder constraints without MMI reranking achieve slightly better plausibility scores since they tend to generate safe, dull responses. However, we find that when using a beam size of 200 and MMI re-ranking, our approach which incorporates distributional constraints, DC-MMI200, achieves competitive plausibility, while achieving significantly higher content richness.

Statistical Significance of Results
To verify the statistical significance of our findings, we conducted a pairwise bootstrap test (Efron and Tibshirani, 1994;Berg-Kirkpatrick et al., 2012) comparing the difference between percentage of Agree annotations (Yes column in the Table 5). We computed p-values for each pair of models: MMI200 vs DC-MMI200 and TA vs DC-MMI200. For plausibility, we did not find a significant difference in either comparison (pvalue ≈ 0.25) while for content richness, both differences were found to be significant (p-value <10 −4 ). To summarize: our model significantly beats both baselines in terms of content richness while the difference in plausibility was not found to be statistically significant.

Pairwise Evaluation of Interestingness
To further validate our claims we also did a side by side comparison study between MMI200 and DC-MMI200. For every test case, we showed Mechanical Turk workers the source sentence along with responses generated by both systems and asked them select which is more interesting. We observe that in 56% out of 1000 cases, DC-MMI200 was rated as the more interesting response. The result is statistically significant with p-value <4 × 10 −4 (using an exact binomial test).

Model Variations
To see the effectiveness of our decoding constraints separately, we compare the best performing DC-MMI200 model with DC-10 and DC-MMI10, both of which use a beam size of 10 -DC-10 does not include MMI reranking. The results of Mechanical Turk evaluation, following the approach described in Section 7.2, are presented in Table 6. We observe that with a beam size of 10 our model is able to generate content rich responses, but suffers in terms of plausibility. The values in the table suggests the decoding constraints defined in this work successfully inject content words into candidate hypotheses and that MMI is able to effectively choose plausible candidates. In the case of DC-10 and DC-MMI10, both models generate the same candidates, but MMI is able to re-rank the results and thus improves plausibility.

Related Work
Conversational agents primarily fall into two categories: task oriented dialogue systems (Williams et al., 2013;Wen et al., 2015) and chatbots (Weizenbaum, 1966), although there have been some efforts to integrate the two (Dodge et al., 2015;. Some of the earliest work on data-driven chatbots (Ritter et al., 2011) explored the use of phrase-based Statistical Machine Translation (SMT) on large numbers of conversations gathered from Twitter (Ritter et al., 2010). Subsequent progress on the use of neural networks in machine translation inspired the use of Sequence-to-Sequence (Seq2Seq) models for data-driven response generation (Shang et al., 2015;Sordoni et al., 2015;Li et al., 2016a). Our approach, which incorporates distributional constraints into the decoding objective, is related to prior work on posterior regularization (Mann and McCallum, 2008;Ganchev et al., 2010;Zhu et al., 2014). Posterior regularization introduces similar distributional constraints on expectations computed over unlabeled data using a model's parameters. These are typically added to the learning objective for semi-supervised scenarios where available labeled data is limited. In contrast, our approach introduces distributional constraints into the decoding objective as a way to combine neural conversation models trained on large quantities of conversational data with separately trained models of topics and semantic similarity that can drive content selection.
There are numerous examples of related work on improving neural conversation models. Shao et. al. (2017) introduced a stochastic approach to beam search that does segment-by-segment reranking to promote diversity. Zhang et. al. (2018) develop models which converse while assuming a persona defined by a short description of attributes. Wang et. al. (2017) suggested decoding methods that influence the style and topic of the generated response. Bosselutet al. (2018) develop discourse-aware rewards with reinforcement learning (RL) to generate long and coherent texts. Li et. al. (2016c) applied deep reinforcement learning to dialogue generation to maximize long-term reward of the conversation, as opposed to directly maximizing likelihood of the response. This line of work was further extended with adversarial learning (Li et al., 2017) that rewards generated conversations that are indistinguishable from real conversations in the data. Lewis et. al. (2017) applied reinforcement learning with dialogue rollouts to generate replies that maximize expected reward, while learning to generate responses from a crowdsourced dataset of negotiation dialogues.  used crowd-workers to gather a corpus of 100K information-seeking QA dialogues that are answerable using text spans from Wikipedia. Niu and Bansal (2018) designed a number of weakly-supervised models that generate polite, neutral or rude responses. Their fusion model combines a language model trained on polite utterances with the decoder. In the second method they prepend the utterance with a politeness label and scale its embedding to vary politeness. The third model is Polite-RL which assigns a reward based on a politeness classifier. Gimpel et. al. (2013) explored methods for increasing the diversity of N-best lists in machine translation by in-troducing a pairwise dissimilarity function. Similar ideas have been explored in the context of neural generation models. (Vijayakumar et al., 2016;Li and Jurafsky, 2016;Li et al., 2016b) Following previous work we evaluated our approach using a combination of automatic metrics and human judgments. Some recent work has explored the possibility of adversarial evaluation of neural conversation models (Lowe et al., 2017;Li et al., 2017).

Conclusions
We presented an approach to generate more interesting responses in neural conversation models by incorporating side information in the form of distributional constraints. When using maximum likelihood decoding objectives, neural conversation models tend to generate safe responses, such as "I don't know" for most inputs. Our proposed approach provides a flexible method of incorporating a broad range of distributional constraints into the decoding objective. We proposed and empirically evaluated two constraints that factorize over words, and therefore naturally fit into the commonly used left-to-right beam search decoding framework. The first encourages the use of more relevant topic words in the response the second encourages semantic similarity between the source and target. We empirically demonstrated, through human evaluation, that when taken together these constraints lead to responses that contribute significantly more information to the conversation, while maintaining plausibility in the context of the input.