Diversifying Dialogue Generation with Non-Conversational Text

Neural network-based sequence-to-sequence (seq2seq) models strongly suffer from the low-diversity problem when it comes to open-domain dialogue generation. As bland and generic utterances usually dominate the frequency distribution in our daily chitchat, avoiding them to generate more interesting responses requires complex data filtering, sampling techniques or modifying the training objective. In this paper, we propose a new perspective to diversify dialogue generation by leveraging non-conversational text. Compared with bilateral conversations, non-conversational text are easier to obtain, more diverse and cover a much broader range of topics. We collect a large-scale non-conversational corpus from multi sources including forum comments, idioms and book snippets. We further present a training paradigm to effectively incorporate these text via iterative back translation. The resulting model is tested on two conversational datasets from different domains and is shown to produce significantly more diverse responses without sacrificing the relevance with context.


Introduction
Seq2seq models have achieved impressive success in a wide range of text generation tasks.In opendomain chitchat, however, people have found the model tends to strongly favor short, generic responses like "I don't know" or "OK" (Vinyals and Le, 2015;Shen et al., 2017a).The reason lies in the extreme one-to-many mapping relation between every context and its potential responses (Zhao et al., 2017;Su et al., 2018).Generic utterances, which can be in theory paired with most context, usually dominate the frequency distribution in the dialogue training corpus and thereby pushes the model to * Equal contribution.

Conversational Text
Context 暗恋的人却不喜欢我 (Translation) The one I have a crush on doesn't like me.

Non-Conversational Text
Forum Comments

暗恋这碗酒，谁喝都会醉啊
Crush is an alcoholic drink, whoever drinks it will get intoxicated.

何必等待一个没有结果的等待
Why wait for a result without hope

真诚的爱情之路永不会是平坦的
The course of true love never did run smooth (From A Midsummer Night's Dream) Table 1: A daily dialogue and non-conversational text from three sources.The contents of non-conversational text can be potentially utilized to enrich the response generation.
blindly produce these safe, dull responses (Su et al., 2019b;Csáky et al., 2019) Current solutions can be roughly categorized into two classes: (1) Modify the seq2seq itself to bias toward diverse responses (Li et al., 2016a;Shen et al., 2019a).However, the model is still trained on the limited dialogue corpus which restricts its power at covering broad topics in opendomain chitchat.(2) Augment the training corpus with extra information like structured world knowledge, personality or emotions (Li et al., 2016b;Dinan et al., 2019), which requires costly human annotation.
In this work, we argue that training only based on conversational corpus can greatly constrain the usability of an open-domain chatbot system since many topics are not easily available in the dialogue format.With this in mind, we explore a cheap way to diversify dialogue generation by utilizing large amounts of non-conversational text.Compared with bilateral conversations, non-conversational text covers a much broader range of topics, and can be easily obtained without further human annotation from multiple sources like forum comments, idioms and book snippets.More importantly, nonconversational text are usually more interesting and contentful as they are written to convey some spe-cific personal opinions or introduce a new topic, unlike in daily conversations where people often passively reply to the last utterance.As can be seen in Table 1, the response from the daily conversation is a simple comfort of "Head pat".Nonconversational text, on the contrary, exhibit diverse styles ranging from casual wording to poetic statements, which we believe can be potentially utilized to enrich the response generation.
To do so, we collect a large-scale corpus containing over 1M non-conversational utterances from multiple sources.To effectively integrate these utterances, we borrow the back translation idea from unsupervised neural machine translation (Sennrich et al., 2016;Lample et al., 2018b) and treat the collected utterances as unpaired responses.We first pre-train the forward and backward transduction model on the parallel conversational corpus.The forward and backward model are then iteratively tuned to find the optimal mapping relation between conversational context and non-conversational utterances (Cotterell and Kreutzer, 2018).By this means, the content of non-conversational utterances is gradually distilled into the dialogue generation model (Kim and Rush, 2016), enlarging the space of generated responses to cover not only the original dialogue corpus, but also the wide topics reflected in the non-conversational utterances.
We test our model on two popular Chinese conversational datasets weibo (Shang et al., 2015a) and douban (Wu et al., 2017).We compare our model against retrieval-based systems, style-transfer methods and several seq2seq variants which also target the diversity of dialogue generation.Automatic and human evaluation show that our model significantly improves the responses' diversity both semantically and syntactically without sacrificing the relevance with context, and is considered as most favorable judged by human evaluators1 .

Related Work
The tendency to produce generic responses has been a long-standing problem in seq2seq-based open-domain dialogue generation (Vinyals and Le, 2015;Li et al., 2016a).Previous approaches to alleviate this issue can be grouped into two classes.
The first class resorts to modifying the seq2seq architecture itself.For example, Shen et al. (2018a); Zhang et al. (2018b) changes the train-ing objective to mutual information maximization and rely on continuous approximations or policy gradient to circumvent the non-differentiable issue for text.Li et al. (2016d);Serban et al. (2017a) treat open-domain chitchat as a reinforcement learning problem and manually define some rewards to encourage long-term conversations.There is also research that utilizes latent variable sampling (Serban et al., 2017b;Shen et al., 2018bShen et al., , 2019b)), adversarial learning (Li et al., 2017;Su et al., 2018), replaces the beam search decoding with a more diverse sampling strategy (Li et al., 2016c;Holtzman et al., 2019) or applies reranking to filter generic responses (Li et al., 2016a;Wang et al., 2017).All of the above are still trained on the original dialogue corpus and thereby cannot generate out-of-scope topics.
The second class seeks to bring in extra information into existing corpus like structured knowledge (Zhao et al., 2018;Ghazvininejad et al., 2018;Dinan et al., 2019), personal information (Li et al., 2016b;Zhang et al., 2018a) or emotions (Shen et al., 2017b;Zhou et al., 2018).However, corpus with such annotations can be extremely costly to obtain and is usually limited to a specific domain with small data size.Some recent research started to do dialogue style transfer based on personal speeches or TV scripts (Niu and Bansal, 2018;Gao et al., 2019;Su et al., 2019a).Our motivation differs from them in that we aim at enriching general dialogue generation with abundant non-conversational text instead of being constrained on one specific type of style.
Back translation is widely used in unsupervised machine translation (Sennrich et al., 2016;Lample et al., 2018a;Artetxe et al., 2018) and has been recently extended to similar areas like style transfer (Subramanian et al., 2019), summarization (Zhao et al., 2019) and data-to-text (Chang et al., 2020).To the best of our knowledge, it has never been applied to dialogue generation yet.Our work treats the context and non-conversational text as unpaired source-target data.The backtranslation idea is naturally adopted to learn the mapping between them.The contents of nonconversational text can then be effectively utilized to enrich the dialogue generation.

Dataset
We would like to collect non-conversational utterances that stay close with daily-life topics and can be potentially used to augment the response space.The utterance should be neither too long nor too short, similar with our daily chitchats.Therefore, we collect data from the following three sources: 1. Forum comments.We collect comments from zhihu2 , a popular Chinese forums.Selected comments are restricted to have more than 10 likes and less than 30 words3 .
2. Idioms.We crawl idioms, famous quotes, proverbs and locutions from several websites.These phrases are normally highly-refined and graceful, which we believe might provide a useful augmentation for responses.
3. Book Snippets.We select top 1,000 favorite novels or prose from wechat read4 .Snippets highlighted by readers, which are usually quintessential passages, and with the word length range 10-30 are kept.
We further filter out sentences with offensive or discriminative languages by phrase matching against a large blocklist.The resulting corpus contains over 1M utterances.The statistics from each source are listed in Table 2.
4 Approach . ., T M } denotes our collected corpus where T i is a non-conversational utterance.As the standard seq2seq model trained only on D tends to generate over-generic responses, our purpose is to diversify the generated responses by leveraging the non-conversational corpus D T , which are semantically and syntactically much richer than responses contained in D. In the following section, we first go through several baseline systems, then introduce our proposed method based on back translation.

Retrieval-based System
The first approach we consider is a retrieval-based system that considers all sentences contained in D T as candidate responses.As the proportion of generic utterances in D T is much lower than that in D, the diversity will be largely improved.
Standard retrieval algorithms based on contextmatching (Wu et al., 2017;Bartl and Spanakis, 2017) fail to apply here since non-conversational text does not come with its corresponding context.Therefore, we train a backward seq2seq model on the parallel conversational corpus D to maximize p(X i |Y i ).The score assigned by the backward model, which can be seen as an estimation of the point-wise mutual information, is used to rank the responses (Li et al., 2016a) 5 .The major limitation of the retrieval-based system is that it can only produce responses from a finite set of candidates.The model can work well only if an appropriate response already exists in the candidate bank.Nonetheless, due to the large size of the non-conversational corpus, this approach is a very strong baseline.

Weighted Average
The second approach is to take a weighted average score of a seq2seq model trained on D and a language model trained on D T when decoding responses.The idea has been widely utilized on domain adaptation for text generation tasks (Koehn and Schroeder, 2007;Wang et al., 2017;Niu and Bansal, 2018).In our scenario, basically we hope the generated responses could share the diverse topics and styles of the non-conversational text, yet stay relevant with the dialogue context.The seq2seq model S2S is trained on D as an indicator of how relevant each response is with the context.A language model L is trained on D T to measure how the response matches the domain of D T .The decoding probability for generating word w at time step t is assigned by: where α is a hyperparameter to adjust the balance between the two.Setting α = 1 will make it degenerate into the standard seq2seq model while α = 0 will totally ignore the dialoge context.

Multi-task
The third approach is based on multi-task learning.A seq2seq model is trained on the parallel conversational corpus D while an autoencoder model is trained on the non-parallel monologue data D T .Both models share the decoder parameters to facilitate each other.The idea was first experimented on machine translation in order to leverage large amounts of target-side monolingual text (Luong et al., 2016;Sennrich et al., 2016).Luan et al. (2017) extended it to conversational models for speaker-role adaptation.The intuition is that by tying the decoder parameters, the seq2seq and autoencoder model can learn a shared latent space between the dialogue corpus and non-conversational text.When decoding, the model can generate responses with features from both sides.

Back Translation
Finally, we consider the back translation technique commonly used for unsupervised machine translation (Artetxe et al., 2018;Lample et al., 2018a).The basic idea is to first initialize the model properly to provide a good starting point, then iteratively perform backward and forward translation to learn the correspondence between context and unpaired non-conversational utterances.
Initialization Unlike unsupervised machine translation, the source and target side in our case come from the same language, and we already have a parallel conversational corpus D, so we can get rid of the careful embedding alignment and autoencoding steps as in Lample et al. (2018b).For the initialization, we simply train a forward and backward seq2seq model on D. The loss function is: where P f and P b are the decoding likelihood defined by the forward and backward seq2seq model respectively.We optimize Eq. 2 until convergence.Afterwards, the forward and backward seq2seq can learn the backbone mapping relation between a context and its response in a conversational structure.
Backward After the initialization, we use the backward seq2seq to create pseudo parallel training examples from the non-conversational text D T .The forward seq2seq is then trained on the pseudo pairs.The objective is to minimize: where we approximate the arg max function by using a beam search decoder to decode from the backward model P b (u|T i ).Because of the nondifferentiability of the arg max operator, the gradient is only passed through P f but not P b6 .As P b is already well initialized by training on the parallel corpus D, the back-translated pseudo where the arg max function is again approximated with a beam search decoder and the gradient is only backpropagated through P b .Though X i has its corresponding Y i in D, we drop Y i and instead train on forward translated pseudo pairs {X i , f (X i )}.
As P f is trained by leveraging data from D T , f (X i ) can have superior diversity compared with Y i .
The encoder parameters are shared between the forward and backward models while decoders are separate.The backward and forward translation are iteratively performed to close the gap between P f and P b (Hoang et al., 2018;Cotterell and Kreutzer, 2018).The effects of non-conversational text are strengthened after each iteration.Eventually, the forward model will be able to produce diverse responses covering the wide topics in D T .Algorithm 1 depicts the training process.

Datasets
We conduct our experiments on two Chinese dialogue corpus Weibo (Shang et al., 2015b) and Douban (Wu et al., 2017).Weibo 7 is a popular Twitter-like microblogging service in China, on which a user can post short messages, and other 7 http://www.weibo.com/users make comment on a published post.The postcomment pairs are crawled as short-text conversations.Each utterance has 15.4 words on average and the data is split into train/valid/test subsets with 4M/40k/10k utterance pairs.Douban8 is a Chinese social network service where people can chat about different topics online.The original data contains 1.1M multi-turn conversations.We split them into two-turn context-response pairs, resulting in 10M train, 500k valid and 100K test samples.

General Setup
For all models, we use a two-layer LSTM (Hochreiter and Schmidhuber, 1997) encoder/decoder structure with hidden size 500 and word embedding size 300.Models are trained with Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.15.We set the batch size as 256 and use the gradients clipping of 5. We build out vocabulary with character-based segmentation for Chinese.For non-Chinese tokens, we simply split by space and keep all unique tokens that appear at least 5 times.Utterances are cut down to at most 50 tokens and fed to every batch.We implement our models based on the OpenNMT toolkit (Klein et al., 2017) and other hyperparameters are set as the default values.

Compared Models
We compare our model with the standard seq2seq and four popular variants which were proposed to improve the diversity of generated utterances.All of them are trained only on the parallel conversational corpus: Standard The standard seq2seq with beam search decoding (size 5).
MMI The maximum mutual information decoding which reranks the decoded responses with a backward seq2seq model (Li et al., 2016a).The hyperparameter λ is set to 0.5 as suggested.200 candidates per context are sampled for re-ranking Diverse Sampling The diverse beam search strategy proposed in Vijayakumar et al. ( 2018) which explicitly controls for the exploration and exploitation of the search space.We set the number of groups as 5, λ = 0.3 and use the Hamming diversity as the penalty function as in the paper.
Nucleus Sampling Proposed in Holtzman et al. (2019), it allows for diverse sequence generations.Instead of decoding with a fixed beam size, it samples text from the dynamic nucleus.We use the default configuration and set p = 0.9.CVAE The conditional variational autoencoder (Serban et al., 2017b;Zhao et al., 2017) which injects diversity by imposing stochastical latent variables.We use a latent variable with dimension 100 and utilize the KL-annealing strategy with step 350k and a word drop-out rate of 0.3 to alleviate the posterior collapse problem (Bowman et al., 2016).
Furthermore, we compare the 4 approaches mentioned in §4 which incorporate the collected nonconversational text: Retrieval-based ( §4.1) Due to the large size of the non-conversational corpus, exact ranking is extremely slow.Therefore, we first retrieve top 200 matched text with elastic search based on the similarity of Bert embeddings (Devlin et al., 2019).Specifically, we pass sentences through Bert and derive a fixed-sized vector by averaging the outputs from the second-to-last layer (May et al., 2019) 9 .The 200 candidates are then ranked with the backward score10 .
Weighted Average ( §4.2) We set λ = 0.5 in eq. 1, which considers context relevance and diversity with equal weights.
Multi-task (( §4.3))We concatenate each contextresponse pair with a non-conversational utterance and train with a mixed objective of seq2seq and autoencoding (by sharing the decoder).
Back Translation ( §4.4)We perform the iterative backward and forward translation 4 times for both datasets.We observe the forward cross entropy loss converges after 4 iterations.

Results
As for the experiment results, we report the automatic and human evaluation in §6.1 and §6.2 respectively.Detailed analysis are shown in §6.3 to elaborate the differences among model performances and some case studies.

Automatic Evaluation
Evaluating dialogue generation is extremely difficult.Metrics which measure the word-level overlap like BLEU (Papineni et al., 2002) have been widely used for dialogue evaluation.However, these metrics do not fit into our setting well as we would like to diversify the response generation with an external corpus, the generations will inevitably differ greatly from the ground-truth references in the original conversational corpus.Though we report the BLEU score anyway and list all the results in Table 3, it is worth mentioning that the BLEU score itself is by no means a reliable metric to measure the quality of dialogue generations.
Diversity Diversity is a major concern for dialogue generation.Same as in (Li et al., 2016a), we measure the diversity by the ratio of distinct unigrams (Dist-1) and bigrams (Dist-2) in all generated responses.As the ratio itself ignores the frequency distribution of n-grams, we further calculate the entropy value for the empirical distribution of n-grams (Zhang et al., 2018b).A larger entropy indicates more diverse distributions.We report the entropy of four-grams (Ent-4) in Table 3.Among models trained only on the conversational corpus, the standard seq2seq performed worst as expected.All different variants improved the diversity more or less.Nucleus sampling and CVAE generated most diverse responses, especially Nucleus who wins on 6 out of the 8 metrics.By incorporating the non-conversational corpus, the diversity of generated responses improves dramatically.The retrieval-based system and our model perform best, in most cases even better than human references.This can happen as we enrich the response generation with external resources.The diversity would be more than the original conversational corpus.Weighted-average and multi-task models are relatively worse, though still greatly outperforming models trained only on the conversational corpus.We can also observe that our model improves over standard seq2seq only a bit after one iteration.As more iterations are added, the diversity improves gradually.
Relevance Measuring the context-response relevance automatically is tricky in our case.The typical way of using scores from forward or backward models as in Li and Jurafsky (2017)  for the seq2seq model trained on only on the conversational corpus and thus would be assigned very low scores.Apart from the BLEU-2 score, we further evaluate the relevance by leveraging an adversarial discriminator (Li et al., 2017).As has been shown in previous research, discriminative models are generally less biased to high-frequent utterances and more robust against their generative counterparts (Lu et al., 2017;Luo et al., 2018).The discriminator is trained on the parallel conversational corpus distinguish correct responses from randomly sampled ones.We encode the context and response separately with two different LSTM neural networks and output a binary signal indicating relevant or not11 .The relevance score is defined as the success rate that the model fools the adversarial classifier into believing its generations (Adver in

Human Evaluation
Apart from automatic evaluations, we also employed crowdsourced judges to evaluate the quality of generations for 500 contexts of each dataset.We focus on evaluating the generated responses regarding the (1) relevance: if they coincide with the context (Rel), (2) interestingness: if they are interesting for people to continue the conversation (Inter) and (3) fluency: whether they are fluent by grammar (Flu) 13 .Each sample gets one point if judged as yes and zero otherwise.Each pair is judged by three participants and the score supported by most people is adopted.The averaged scores are summarized in Table 4.We compare the standard seq2seq model, nucleus sampling which performs best among all seq2seq variants, and the four approaches leveraging the non-conversational text.All models perform decently well as for fluency except the weighted average one.The scores for diversity and relevance generally correlate well with the automatic evaluations.approaches in its capability to properly make use of the non-conversational corpus.

Effect of Iterative Training
To show the importance of the iterative training paradigm, we visualize the change of the validation loss in Figure 2 14 .The forward validation loss is computed as the perplexity of the forward seq2seq on the pseudo context-response pairs obtained from the backward model, vice versa for backward loss.It approximately quantifies the KL divergence between them two (Kim and Rush, 2016;Cotterell and Kreutzer, 2018).As the iteration goes, the knowledge from the backward model is gradually distilled into the forward model.The divergence between them reaches the lowest point at iteration 4, where we stop our model.shown in Table 6.To find semantically novel responses, we segment them into phrases and find those containing novel phrases that do not exist on the conversational corpus.As in the first example of Table 6, the word 胖若两人 only exists in the non-conversational corpus.The model successfully learnt its semantic meaning and adopt it to generate novel responses.It is also common that the model learns frequent syntax structures from the non-conversational corpus.In the second example, it learnt the pattern of "To ... is easy, to ... is hard", which appeared frequently in the non-conversational corpus, and utilized it to produce novel responses with the same structure.Note that both generations from the BT model never appear exactly in the non-conversational corpus.
It must generate them by correctly understanding the meaning of the phrase components instead of memorizing the utterances verbally.

Conclusion and Future Work
We propose a novel way of diversifying dialogue generation by leveraging non-conversational text.
To do so, we collect a large-scale corpus from forum comments, idioms and book snippets.By training the model through iterative back translation, it is able to significantly improve the diversity of generated responses both semantically and syntactically.We compare it with several strong baselines and find it achieved the best overall performance.The model can be potentially improved by filtering the corpus according to different domains, or augmenting with a retrieve-and-rewrite mechanism, which we leave for future work.

Figure 1 :
Figure 1: Comparison of four approaches leveraging the non-conversational text.S2S f w , S2S bw and LM indicate the forward, backward seq2seq and language model respectively.(d) visualizes the process of one iteration for the back translation approach.Striped component are not updated in each iteration.

Figure 2 :
Figure 2: Change of validation loss across iterations.

Table 2 :
Statistics of Non-Conversational Text.
Model Training Process pair {b(T i ), T i } can roughly follow the typical human conversational patterns.Training P f on top of them will encourage the forward decoder to generate utterances in the domain of T i while maintaining coherent as a conversation.ForwardThe forward translation follows a similar step as back translation.The forward seq2seq P f translates context into a response, which in return form a pseudo pair to train the backward model P b .The objective is to minimize:

Table 3 :
is not suitable as our model borrowed information from extra resources.The generated responses are out-of-scope Automatic on Weibo and Douban datasets.Upper areas are models trained only on the conversational corpus.Middle areas are baseline models incorporating the non-conversational corpus.Bottom areas are our model with different number of iterations.Best results in every area are bolded.

Table 4 :
Human Evaluation Results 不知道该怎么办 (I don't know what to do.) [Iteration 1]: 单身不可怕，单身不可怕(Being single is nothing, being single is nothing.)[Iteration 4]: 斯人若彩虹，遇上方知有(Every once in a while you find someone who's iridescent, and when you do, nothing will ever compare.)

Table 5 :
Example of response generation in different iterations.
Table 5 further displays examples for different iterations.Iteration 0 generates mostly generic responses.Iteration 1 starts to become more diverse but still struggle with fluency and relevance.In the final iteration, it can learn to incorporate novel topics from the non-conversational text yet maintaining the relevance with context.Diversity of Generation We find the back translation model can generate both semantically and syntactically novel responses.Some examples are 14 Iteration 0 means before the iteration starts but after the initialization stage, equal to a standard seq2seq.