SMRTer Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multi-Reference Training

Non-task-oriented dialog models suffer from poor quality and non-diverse responses. To overcome limited conversational data, we apply Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020), and use a paraphraser to simulate multiple responses per training prompt. We find SMRT improves over a strong Transformer baseline as measured by human and automatic quality scores and lexical diversity. We also find SMRT is comparable to pretraining in human evaluation quality, and outperforms pretraining on automatic quality and lexical diversity, without requiring related-domain dialog data.


Introduction
Non-task-oriented dialog is a low-resource NLP task. While large and noisy related corpora exist (e.g. movie subtitles, social media, and irclogs; Serban et al., 2018), the publicly-released curated corpora are small. Serban et al. note that smaller corpora have lower lexical diversity and topic coverage, leading to models with poor quality non-diverse responses. Pretraining on larger data may improve performance, but requires a large dialog corpus in the right language and related domain.
We leverage Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020) to overcome sparse dialog data. SMRT uses a wordlevel knowledge distillation-inspired objective and a paraphraser to simulate multiple references per training example. Khayrallah et al. introduce SMRT for machine translation (MT) and simulate training on all translations for a source sentence, assuming: (1) all paraphrases of a target are translations of the source; and (2) all translations of the source are paraphrases of the target. (1) is true for dialog, but (2) is not-valid chatbot responses vary in meaning. SMRT captures syntactic diversity though it cannot represent all semantic variations. prompt: Study, study, study. I want to learn a lot. response: You are going to take courses?
paraphrases:  We apply SMRT to chatbots and find that it: (1) improves human and automatic quality scores; (2) improves lexical diversity; (3) performs as well as pretraining in human evaluation with better performance on automatic measures of diversity and quality.

Method
We model the non-task-oriented dialog system (chatbot) task as conditional language modeling. These models are typically trained using Negative Log Likelihood (NLL) with respect to a single reference. An alternative approach is Knowledge Distillation (Hinton et al., 2015;Kim and Rush, 2016) which assumes access to a teacher distribution (q(y | x)) and minimizes the cross entropy with the teacher's probability distribution. source x) and generates a paraphrase y . Additionally, SMRT samples a new paraphrase of the reference every epoch. The SMRT training objective for the i th target word in the reference y, given the prompt x, with a target vocabulary V is: The paraphraser and chatbot each condition on the previously sampled paraphrase tokens (y j<i ).

Dialog models
We train Transformer (Vaswani et al., 2017) chatbots in FAIRSEQ using parameters from the FLO-RES 1 benchmark for low-resource MT (Guzmán et al., 2019) for both a standard NLL baseline and SMRT. 2 Following Khayrallah et al. (2020), we sample from the 100 highest probability tokens from the paraphraser distribution at each time-step (Fan et al., 2018).
We train and evaluate on DailyDialog (Li et al., 2017), a high quality corpus with multiple references for evaluation. We train on the ∼ 80,000 turns of English-learners practicing 'daily dialogues' in various contexts, e.g., chatting about vacation or food.
See Appendix A for full details for replication.

Paraphraser
We use the state-of-the-art PRISM multilingual paraphraser Thompson and Post (2020a,b). 3 It is trained as a multilingual MT model on ∼ 100 million sentence pairs in 39 languages. Paraphrasing is treated as zero-shot translation (e.g., English to English).

Evaluation Protocols
Human

Results
SMRT is preferred over the baseline system in human evaluation, as shown in Table 2. It outperforms the baseline in automatic quality too: see Table 3. Our baseline outperforms nearly all systems in Gupta et al. (2019) for these metrics, 4 suggesting it is a strong baseline. SMRT has higher lexical diversity than the baseline, though not as high as the human reference response (Table 4).
baseline SMRT tie 35.8% 43.5% 20.6%   Table 4: Type/Token ratio for the baseline and SMRT. SMRT has higher lexical diversity than the baseline.

Analysis
SMRT outperforms a strong baseline; here we analyze it in additional settings: pretraining and MMI.

Pretraining
Pretraining is another way of incorporating auxiliary data in the model. We pretrain on the OpenSubtitles corpus (OS; Lison and Tiedemann, 2016), 5 which consists of ∼ 200 million turns from movie subtitles. Similar to DailyDialog, it consists of conversational data on a variety of topics. After pretraining on OS, we fine-tune on DailyDialog.
Results In the human evaluation (Table 5), SMRT performs comparably to baseline pretraining. In automatic evaluation (Table 6), SMRT outperforms pretraining. We combine SMRT with pretraining 6 and find that this again performs comparably to baseline pretraining in human evaluation, and pretraining with SMRT performs better in the automatic evaluation. Finally, we compare SMRT with and without pretraining, and find with pretraining is preferred in human evaluation, while they perform similarly on the automatic metrics. Pretraining improves the NLL baseline's diversity, but SMRT's diversity is still better. Combining SMRT with pretraining improves diversity compared to pretraining alone: see Table 7.
Overall, SMRT performs on par with pretraining in terms of human evaluation of quality, with better diversity and automatic metrics of quality. 7 Discussion It can be hard to find dialog corpora that are large, domain relevant, and in-language.
Unlike pretraining, SMRT incorporates nondialog data. PRISM was trained to translate, and leveraged as a paraphrase model using zero-shot translation. It is not trained to generate dialog, yet we still leverage it to improve a chatbot.
The paraphraser is trained on less data (∼ 100 million sentences pairs, with ∼ 17 million English sentences) than is used for OpenSubtitles pretraining (∼ 200 million turns-all in English), thus competitive performance is not a result of more data.
PRISM was trained on formal text: Wikipedia, news (Global Voices, and SETimes) parliamentary proceedings (EuroParl), and documents (United    DailyDialog is well matched to OpenSubtitles, and yet SMRT performs as well as pretraining on OS. This suggests SMRT is effective at leveraging non-dialog data, which is crucial when no indomain, in-language dialog data is available.

MMI
Maximum Mutual Information (MMI) decoding, (1−λ) log p(y|x)+λ log p(x|y), is commonly used in dialog to increase response diversity (Li et al., 2016), however we did not find it helpful in our experiments. Following MMI-bidi, we rerank a 100-best list with a reverse model. 8 When comparing both models with MMI, we find human prefer SMRT to the baseline, see Table 9. MMI degrades automatic measures of quality (Table 10) and diversity (Table 11) of both the baseline and SMRT models compared to standard decoding. The quality degradation is similar for both, but the degradation in diversity is more pronounced for SMRT.

Examples
For a training pair and paraphrased responses, see Table 1. SMRT decreases that number of dull and 8 We sweep λ of 0.1, 0.2, 0.3, 0.4, 0.5. 0.1 performs best on the automatic quality metrics, so we use that for analysis.  off-topic answers, see Table 8. In prompt (a), the baseline is off-topic. Pretraining expresses sympathy, but is unhelpful. SMRT and pretrained SMRT give relevant responses. In (b), the baseline has the right general topic but is a poor response. Both SMRT variants and the pretrained baseline respond well. For more examples, see Appendix C.

Simulated Multiple Reference Training
Since it trains toward a distribution rather than a 1-hot vector, SMRT may have more reasonable confidence levels.

Conclusion
SMRT improves upon a strong Transformer baseline in quality and diversity. It also has human evaluation quality comparable to pretraining, with better automatic quality and lexical diversity. This method, which works even in settings where pretraining is impractical due to a lack of in-domain same language dialog data, has a high potential for impact in creating chatbots for more languages.

A.1 Dialog Models
We train Transformer conditional language models in FAIRSEQ using parameters from the FLORES 9 benchmark for low-resource machine translation (Guzmán et al., 2019) for both the baseline and SMRT. We use the publicly released SMRT fork of FAIRSEQ (Ott et al., 2019;Khayrallah et al., 2020), 10 along with the PRISM M39V1 paraphraser (Thompson and Post, 2020a). 11 We use a 5-layer encoder and decoder, 512 dimensional embeddings, and 2 encoder and decoder attention heads. We regularize with 0.2 label smoothing, and 0.4 dropout. We optimize using Adam with a learning rate of 10 −3 . We train 100 epochs, and select the best checkpoint based on validation set perplexity. We generate with a beam size of 10, and no length penalty. Figure 1 shows the train command for SMRT, Figure 2 shows the train command for the NLL baseline.
We train and evaluate on the DailyDialog corpus (Li et al., 2017), as released by ParlAI (Miller et al., 2017). 12 We pretrain on the OpenSubtitles corpus (OS; Lison and Tiedemann, 2016). 13 Since SMRT compares the distribution over tokens from the paraphraser and chatbot their vocabularies must match, so we apply the PRISM Sen-tencePiece model (Kudo and Richardson, 2018) to the DailyDialog and OpenSubtitles corpora. The ParlAI release of DailyDialog is tokenized and lowercased. Since the data the paraphraser is trained on is not, we detokenize and recase the DailyDialog data. We then provide the PRISM dictionary when running FAIRSEQ-PREPROCESS (see Figure 3).

A.2.1 Human Evaluation
We randomly sample 500 prompt-response pairs from the test set, and filter out any that are not distinct, leaving 482 pairs.

A.2.2 Automatic Quality Evaluation
In Appendix B we report the full automatic evaluation results of the 14 metrics across both the single reference and multi-reference evaluation from the the multi-reference automatic evaluation framework for DailyDialog released by Gupta et al. (2019), which is computed using NLG-EVAL 14 (Sharma et al., 2017). This include wordoverlap metrics: BLEU (Papineni et al., 2002), ME-TEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004) as well as embedding based metrics: SkipThought (Kiros et al., 2015), embedding average (Forgues et al., 2014), vector extrema and Greedy Matching (Rus and Lintean, 2012). For reading ease, we reports metrics scaled between 0 and 100 rather than 0 and 1.
baseline: I'm afraid I'm not sure. SMRT: I'm sorry, I'm going to buy it! PT + baseline: Anything else? PT + SMRT: That sounds like a good idea. Table 19: An evaluation prompt and example output. The baseline is off-topic, but SMRT is better here. Baseline and SMRT pretraining each give good responses.

prompt:
Believe it or not, tea is the most popular beverage in the world after water.
baseline: Really? What's the nature of this city? SMRT: Really? I am so surprised. PT + baseline: That's true. Tea is the most popular in the world. PT + SMRT: Yes, I agree with you, but I don't know what to do. Table 20: An evaluation prompt and example output. The baseline is off-topic, but SMRT and baseline pretraining give good (but very different) responses. Pretrained SMRT is less ideal, though could be a fine response in some contexts.

prompt:
Oh, good. How old is your niece?
baseline: I'm afraid I'm not sure. SMRT: Eighteen. PT + baseline: She's 42. PT + SMRT: She is ten years old. You can buy a ticket from the buyer's cheque. PT + baseline: Sure. Go straight ahead and turn left. PT + SMRT: Go straight ahead and turn left, turn left, turn left and turn left.  Oh, we never speak Spanish. PT + baseline: I don't know. PT + SMRT: No, I don't speak English in the class. Table 24: An evaluation prompt and example output. The baseline pretraining gives an 'I don't know' variant. The baseline is okay, the SMRT systems have better output.