Don’t Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

Generative dialogue models currently suffer from a number of problems which standard maximum likelihood training does not address. They tend to produce generations that (i) rely too much on copying from the context, (ii) contain repetitions within utterances, (iii) overuse frequent words, and (iv) at a deeper level, contain logical flaws.In this work we show how all of these problems can be addressed by extending the recently introduced unlikelihood loss (Welleck et al., 2019) to these cases. We show that appropriate loss functions which regularize generated outputs to match human distributions are effective for the first three issues. For the last important general issue, we show applying unlikelihood to collected data of what a model should not do is effective for improving logical consistency, potentially paving the way to generative models with greater reasoning ability. We demonstrate the efficacy of our approach across several dialogue tasks.


Introduction
Open-ended tasks such as dialogue reveal a number of issues with current neural text generation methods. In more strongly grounded tasks such as machine translation and image captioning, current encoder-decoder architectures provide strong performance, where mostly word-level decisions are often taken correctly by the model. However, critical failings are exposed in less constrained generation: reliance on repetitive copying and overuse of frequent words, and an inability to maintain logical coherence. The former shows the learning objective is faulty in that it cannot match simple statistics of the training data, while the latter touches more to the heart of artificial intelligence: Work done while at Facebook AI Research (FAIR). these models do not understand what they are saying. For example, Figure 1 shows how the 345Mparameter GPT2 model (Radford et al., 2019) can give high probability to contradictory generations.
In this work, we show how the recently introduced unlikelihood objective (Welleck et al., 2019a) can be generalized to remedy these problems. Unlikelihood is a technique developed for removal of repetition in language model completions, and works by adding an extra term to the objective that forces repetitions to have low probability, alleviating the degenerative problems highlighted in . In fact, unlikelihood can be seen as a much more general framework, as we will see.
We first generalize unlikelihood to a different domain: dialogue, where we measure statistics of the training distribution in terms of contextual copies, within-utterance repeats, and vocabulary usage. We then develop loss functions that control these statistics, providing improved metrics on several tasks. Secondly, we show how the same tools can be used to address deeper semantic issues in such models. By leveraging existing natural language inference (NLI) data (Welleck et al., 2019b) as supervision against poor quality generations, we train models that assign low probability to generating incoherent and contradictory text. Overall, our approach yields more consistent dialogue models across several axes, and provides a promising framework for further advances.
Code and pre-trained models will be made available. † 2 Dialogue Unlikelihood Training Dialogue Generation Dialogue generation consists in predicting an utterance y = (y 1 , . . . , y |y| ) given a context x = {s 1 , . . . , s k , u 1 , . . . , u t } that consists of initial context sentences s 1:k (e.g., scenario, knowledge, personas, etc.) followed by dialogue history utterances u 1:t from speakers who take consecutive turns.
Likelihood Training Given a dataset D = {(x (i) , y (i) )} derived from a collection of humanhuman interactions, the standard approach to generative training for dialogue tasks is maximum likelihood estimation (MLE), that minimizes: where x (i) is a gold context (dialogue history and initial context sentences) and y (i) is a gold nextutterance, and y (i) t is the t-th token of y (i) . Likelihood-based (greedy or beam) decoding applied after training a model with this objective yields sequences with statistics that do not match the original human training sequence distribution.

Unlikelihood Training
To control for such distribution mismatches, we employ the unlikelihood loss (Welleck et al., 2019a), generalizing it to our setting, and developing a particular form of the loss function for each type of mismatch.
The general form of the unlikelihood loss penalizes a set of tokens C t at each time-step, where C t ⊆ V is a subset of the vocabulary, and β(y c ) is a candidate-dependent scale that controls how much the candidate token should be penalized. The overall objective in unlikelihood training then consists of mixing the likelihood and unlikelihood losses, (1) † https://parl.ai/projects/dialogue_ unlikelihood/ where α ∈ R is the mixing hyper-parameter. Likelihood tries to model the overall sequence probability distribution, while unlikelihood corrects for known biases. It does this via the set of negative candidates C t calculated at each step t, where we are free to select candidate generation functions depending on the biases to be mitigated. Likelihood pushes up the probability of a gold token y (i) t while unlikelihood pushes down the probability of negative candidate tokens y c ∈ C t .
In Welleck et al. (2019a) the context x consists of a ground-truth sequence (x = x (i) ), the target y is either a ground-truth sequence (y = y (i) ) or a model-generated sequence (y =ŷ), and the pertoken scale parameter β(y c ) is 1.
In this paper, we demonstrate how unlikelihood can be used as a general framework by applying it to the dialogue domain. We show how varying the contexts x, targets y, candidates C and scaling β can be used to improve the coherence and language modeling quality of dialogue models. To do this, we now consider the different biases we wish to mitigate, and construct a specific unlikelihood loss for each in turn.

Repetition and Copying
Generative dialogue models are known to both (i) rely too much on copying existing context knowledge or dialogue history; and (ii) repeat themselves within individual utterances. To address this with unlikelihood, we define two types of negative candidate tokens which either appear in a repeating n-gram from the context or from the generated label itself, where y t is a token in a repeating context n-gram when y t is part of an n-gram that already appeared in the context tokens x, and is in a repeating label n-gram when y t is part of an n-gram that already appeared in y <t . Given a ground-truth context x (i) , we apply these two forms of unlikelihood to a model-generated sequenceŷ (i) . In summary, we either apply the per-example loss for controlling context copies, or for controlling label repeats. We also consider mixing the two losses to mitigate both issues.

Vocabulary Usage
Neural sequence models trained with maximum likelihood generate sequences with token distributions that differ from those of human text (Dinan et al., 2020;. In particular, these models tend to produce high frequency tokens too often and low frequency tokens too rarely, where frequency is defined by the human token distribution. We address this with unlikelihood by penalizing tokens according to the mismatch between the model and ground-truth unigram distributions. Specifically, we first maintain an empirical estimate of the model's unigram distribution p model (y t ) and the human distribution p * (y t ): where Y is a collection of token predictions on a subset of training data D (e.g. the preceding k = 256 batches), and count(y t ) is the number of occurrences of y t in Y . This is computed using model sequences (y =ŷ), defining Y as the collection of all tokens in allŷ. We wish to push down the probability of tokens appearing too often, i.e. when p model (y t ) > p * (y t ). For the unlikelihood loss, each step's candidate is thus the current token, C identity t = {y t }, and each token's unlikelihood loss is scaled according to the mismatch between the approximated model and human distributions, The unlikelihood loss for a token y c is non-zero when the token occurs more often in the model's estimated unigram distribution. In summary, the resulting per-example loss is where y is a model-generated sequence.

Contradictions
Neural generation models appear fluent, especially when pre-trained on large datasets, but are still poor at understanding the language they produce. That is, they can produce logically or factually inaccurate, or contradicting statements (Welleck et al., 2019b;Zhang et al., 2018;Hayashi et al., 2019;Petroni et al., 2019). Here, we show how the unlikelihood objective can be used to train such models to assign low probability to inconsistent and contradictory utterances.
To do so, we assume the existence of training data of both positive and negative examples of coherent behavior. There is a raft of recent largescale, high quality data that can be massaged into this form, from natural language inference (NLI) tasks (Bowman et al., 2015;Williams et al., 2018;Welleck et al., 2019b) to commonsense reasoning tasks (Zellers et al., 2019;Qin et al., 2019). Two collections of data can be derived from the labels of such a supervised task: where D + is coherent behavior, e.g. neutral or entailing data in NLI, and D − is incoherent behavior, e.g. contradictions. In general, many forms of this type of data can be collected, not just NLI, and it is also not necessary for the contexts x (i) to overlap as we have written here.
Standard likelihood training can then be performed on coherent data D + , while the unlikelihood objective is applied to D − as we wish to push down the probability of generating the incoherent response y − given a context x. That is, given an incoherent pair (x, y − ) we use the loss where we penalize each token in the target Hence, the loss makes generating the contradicting sentences less likely.

Related Work
Our work provides new applications of unlikelihood training (Welleck et al., 2019a), showing that unlikelihood offers a general framework for improving generative models, and in particular dialogue models. Outside of that work, the use of negative training in dialogue retrieval, rather than generation, has been previously extensively studied, see e.g. (Humeau et al., 2019;Nugmanova et al., 2019). In the area of generative dialogue, a number of works have focused on improving the standard likelihood training approach. Closer to our work is that of He and Glass (2019) which developed the approach of negative training to prevent generic and malicious responses in dialogue models. In terms of improving repetition and specificity, a recent alternative approach is that of control (Fan et al., 2018;Ficler and Goldberg, 2017;Ghazvininejad et al., 2017;See et al., 2019). Nucleus sampling  can help to remove generic or repetitive utterances at the expense of accuracy, but was shown to be inferior to beam blocking, which in turn was shown to be inferior to unlikelihood in Welleck et al. (2019a).
In terms of dialogue coherence, Welleck et al. (2019b) showed that retrieval, but not generative models, could be improved with NLI as a rescorer, while Yang et al. (2018) multi-tasked with NLI. The work of Gabriel et al. (2019) has also studied improving narrative flow with a discriminative rescorer, but in that case for generated language. In our work, the improvements are tightly integrated into the training of the model itself.

Experiments
In all of our experiments we employ a large pre-trained seq2seq Transformer (Vaswani et al., 2017) as our base model, which we then fine-tune for particular tasks with the objectives outlined in Section 2 and specified in each experiment below. Following previous work (Humeau et al., 2019), we pre-train our model on dialogue data, using a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io, training to generate a comment conditioned on the full thread leading up to the comment, spanning ∼ 2200M training examples. Our Transformer model consists of an 8 layer encoder, 8 layer decoder with 512-dimensional embeddings and 16 attention heads, and is based on the ParlAI implementation of Miller et al. (2017). The model was trained with a batch size of 3072 sequences for approximately 3M updates using a learning rate of 5e-4, and an inverse square root scheduler. This pre-training took approximately two weeks using 64 NVIDIA V100s.

Repetition and Copying
We use the ConvAI2 persona-based dialogue (Zhang et al., 2018)  knowledge-grounded dialogue (Dinan et al., 2019) and ELI5 long-form question answering  datasets to evaluate the effect of using unlikelihood to reduce copying and repetition in model generated utterances. On each dataset, we fine-tune the pre-trained pushshift.io Reddit model, then evaluate by generating nextutterances for dialogue contexts from the test set (or validation in ConvAI2, as the test set is hidden). We use greedy decoding in our main experiments for simplicity and scalability, but we also obtained similar results with beam search, shown in Appendix A.
To measure label repetition in a sequence y, we use the portion of duplicate n-grams: and report the metric averaged over the examples. Label repetition increases from zero as the model generates more repeated n-grams. To measure context repetition, we measure the fraction of gen-  erated n-grams that appear in the original context: and report the metric averaged over the examples. Context repetition increases when the model 'copies' n-grams from the context. To quantify language modeling quality, we use standard perplexity and F1 metrics. We use the pre-trained model fine-tuned with MLE as the baseline, and compare it against the pre-trained model fine-tuned with copy and repetition unlikelihood ( §2.1).

Results
Results for ConvAI2 are shown in Table 1. We see that training unlikelihood using only-contexts or only-labels reduces their corresponding metrics dramatically compared to the MLE baseline. Training with both context-and label-repetition unlikelihood reduced both context repetitions (by 69%, .0352 vs. .1131) and label repetitions (by 89%, .0023 vs .0210) compared to the MLE baseline, much closer to human levels, while keeping perplexity essentially constant.
Comparatively, the Wizard of Wikipedia MLE baseline experiences a much larger problem with context repetition, due to its tendency to copy grounded knowledge verbatim (Table 2).
Results for ELI5, shown in Table 3, show that it has an especially large problem with label repetition, and that label-unlikelihood is able to reduce the repetitions by 91% (.055 vs .617), while significantly boosting F1 (.130 to .182). Figures 2 and 3 show perplexity as a function of label and context repeats respectively using unlikelihood on ELI5. The parameter α can clearly control repeats smoothly, with only very high values resulting in increased perplexity. Human Evaluation Finally, we perform a human evaluation using the same pairwise evaluation scheme as  performed on ELI5, comparing the MLE baseline to UL (Label only) which asks: Which response answers the question better? The evaluators are asked to consider both the readability and accuracy of the answer. Results are given in Figure 4 (left), showing a statistically significant improvement over the baseline (150 trials, two tailed binomial test, p < 0.01). Further details are given in Appendix C.

Vocabulary Usage
We evaluate the ability of vocabulary unlikelihood ( §2.2) to reduce the mismatch between model and human token distributions.
We use the ConvAI2 dataset, where our baseline is again trained using maximum likelihood. Starting with the baseline model, we then fine-tune several models using vocab unlikelihood at logarithmically interpolated values of α ∈ [1, 1000].
We partition the vocabulary into 'frequent', 'medium', 'rare', and 'rarest' using the human unigram distribution computed with the ConvAI2 training set, corresponding to the sorted token sets whose cumulative mass accounts for the top 40%, the next 30%, the next 20% and the final 10% of usage, respectively. We evaluate a model by generating utterances given contexts from the Con-vAI2 validation set, and compute the fraction of tokens within each class.
Results Figure 5 shows how the vocabulary distribution obtained after unlikelihood training is affected by the choice of mixing hyperparameter α (Eq. 1): it can smoothly transition between the human training distribution and the MLE trained distribution ('Baseline'), which is far from the human one. Table 4 compares the MLE baseline with unlikelihood with increasing α values in terms of distribution and F1 score. The vocabulary unlikelihood fine-tuning shifts probability mass from the over-represented frequent words towards underrepresented medium and rare words, with the effect strengthening as α increases. At a small cost to perplexity and F1, the unlikelihood tuning reduced the overuse of common tokens by 9 points, matching the human rate, while improving the production of rare tokens by 3 percentage points.
Human Evaluation Finally, we perform a human evaluation using the ACUTE-EVAL framework (Li et al., 2019), comparing the MLE baseline to UL for various α. First, 252 human-bot conversations (8 turns each) are collected, and then models are compared pairwise by asking the question: Who would you prefer to talk to for a long conversation? For these experiments we compare with both methods generating using beam with context blocking of trigrams. Results are given in Figure 4 (right), showing a statistically significant improvement over the baseline according to humans (two tailed binomial test, p < 0.01). Further details are given in Appendix C.

Contradictions
We use the dialogue natural language inference (NLI) task of Welleck et al. (2019b) to obtain labeled non-contradicting and contradicting dialogue sentence pairs to use in unlikelihood training ( §2.3). Dialogue NLI contains utterances labeled as entailing (E), neutral (N) or contradiction (C), given a premise that is either a persona sentence (an initial context sentence describing a dialogue agent's personality) or another dialogue utterance  Table 4: Unlikelihood loss applied to vocabulary distributions. Stronger α terms greatly shift probability mass from the most Frequent words to Medium and Rare words, at a small cost to PPL and F1. Frequent, medium, rare and rarest token classes are defined as the sets of tokens whose cumulative masses account for the top 40%, the next 30%, the next 20% and final 10% of tokens empirically generated by humans, respectively. from the Persona-Chat dialogue task (Zhang et al., 2018). We show examples from Dialogue NLI in Figure 6: Dialogue NLI from (Welleck et al., 2019b).

Train Test Valid
Entailment 95k 4613 4959 Triple-Entailment 105k 5285 5481 Neutral 110k 5500 5700 Negatives 110k 5500 5700 Table 5: Dialogue NLI two utterance generation task dataset statistics. Figure 6. The original data consists of sentence pairs (s 1 , s 2 ) along with a label (E, N, or C), and was constructed by developing a schema and employing crowdworkers to label utterances with relation triples. The labels are then inferred from the triple representation. We first transform the original classification dataset into a form useful for unlikelihood training of a generative dialogue model. We consider two setups: (i) a two utterance generation task; and (ii) a full dialogue generation task.
Two Utterance Generation Task We adapt the initial dialogue NLI dataset by using entailing and neutral training sentence pairs as plausible positive utterances, and contradicting pairs as negatives. That is, if a pair (s 1 , s 2 ) from Dialogue NLI has label E or N, the example (x, y) = (s 1 , s 2 ) is added to D + , otherwise (label C) it is added to D − .
We consider two types of entailment: entailing sentence pairs that appear together in a dialogue in the original Persona-Chat dataset and are therefore natural ('entailment'), and those that only entail via their triple relations ('triple-entailment'). The latter are more challenging, noisier targets. Evaluation is performed by measuring the test set perplexity over the four target label types, where contradictions should have relatively higher perplexity. We additionally evaluate a selection accuracy task, where for each test example there are two candidate responses: a positive and a negative (contradicting) statement. The candidate response with the lowest perplexity is considered to be the model's selection, and we measure the selection success rate. Evaluation is broken down by positive type (entailment, triple-entailment, neutral). Dataset statistics are given in Table 5.
Full Dialogue Task To evaluate in a more realistic setup that involves full dialogue rather than a single utterance, we take full Persona-Chat dialogues (Zhang et al., 2018) similar to Figure 6, and map back the dialogue NLI data to provide positive and negative continuations of the dialogue. We consider continuations as either triple entailing utterances, neutral utterances or contradictions -where the relation triple is used to match the existing persona or dialogue turns by the same speaker to induce the label. That is, an example (x, y) consists of a dialogue history x = {p 1 , . . . , p k , u 1 , . . . , u t } and utterance y = s 2 , where (s 1 , s 2 ) is a sentence pair from Dialogue NLI, and at least one sentence in x has the same relation triple as s 1 . When the pair (s 1 , s 2 ) is labeled as E or N in Dialogue NLI, the example (x, y) is added to D + , and otherwise it is added to D − .

Results
Our MLE baseline obtains a perplexity of 11.4, in line with current best systems on this task . Unfortunately, despite being good on such standard metrics, our baseline models fail at our coherence task. As seen in Table 6 for the two utterance task, the perplexity of contradicting utterances (12.5) is on average lower than for neutral (36.7) or triple-entailing utterances (17.5), although it is higher than entailing utterances. We believe this is due to contradicting utterances having high word overlap with the premise utterance, coupled with an inability to judge incoherence. Viewed as a selection task between utterances, picking the utterance with the lowest perplexity, this means the selection rates of non-contradicting utterances are very low, e.g. picking neutral utterances over contradicting utterances only 18% of the time. Even fully entailing utterances are only picked 73% of the time. Similar results are found on the full dialogue task as well, see Table 7.
Unlikelihood training brings large improvements in coherence metrics, whilst minimally impacting overall dialogue perplexity. After applying unlikelihood, perplexity for contradicting utterances has a clear signature, with very large av-   Neutral). Selection Accuracy measures how often the model assigns lower perplexity to the positive candidate than to the negative candidate in the pair. Top two rows: for standard maximum likelihood models, the perplexity of contradicting utterances is lower compared to neutral or triple-entailing utterances (albeit higher compared to entailing utterances), showing partial failure at the coherence task. Bottom row: NLI Unlikelihood training yields large improvements on all coherence metrics, while minimally increasing overall perplexity.  We did too but working in real estate for 12 years .
(E) I have been working as a real estate sucked up a lot of time agent for the past 12 years. 3.9 3.8 (C) We did too but working in real estate for fifteen years sucked up a lot of time.
3.1 17.6 erage values compared to entailing or neutral utterances, e.g. 248.9 vs. 9.1 for contradict vs. entail on the two utterance task. This converts to corresponding large increases in selection accuracy across all types on both tasks, e.g., an increase from 18% to 78% on neutral statements on the two utterance task, and from 37.4% to 69.8% on the full dialogue task.
Some example model predictions are given in Figure 7, comparing the MLE baseline and unlikelihood model perplexities of generating the given hypotheses. The likelihood model cannot differentiate between contradicting and entailing statements easily, while there are large perplexity differences for the unlikelihood model in these cases.

Conclusion
Generating consistent and coherent human-like dialogue is a core goal of natural language research. We studied several aspects that contribute to that goal, defined metrics to measure them, and proposed algorithms that improve them, mitigating some of the failings of maximum likelihood training, the current dominant approach. Our method defines objective functions under the umbrella of unlikelihood: during training, we wish to make inconsistent dialogue unlikely by lowering the probability of such events occurring. This makes generative models repeat themselves less, copy the context less, and use more rare words from the vocabulary -closer to matching human statistics. Further, utilizing supervised datasets with labeled coherent and incoherent utterances and applying unlikelihood yields measurably improved levels of coherence with respect to the aspect measured, in this case contradiction. Future work could apply this same technique with other supervised data, e.g. correcting causal or commonsense reasoning errors (Zellers et al., 2019;Qin et al., 2019).

A Repetition Control with Beam Search
The experiments on repetition and copying in the main paper were carried out with greedy decoding for simplicity. In this section we show that similar results hold with beam decoding as well. Using a beam size of 5, we take the same 4 models from Table 2 and compute metrics with beam instead. The results are given in Table 8 which show similar trends to before, except the baseline model using beam tends to suffer more from repetition, which is a known result . Note that we simply evaluated the same unlikelihood models as before, but we expect that better results could be obtained by performing sequence level unlikelihood training with beam search in the training loop, as well as choosing hyperparameters specifically with this kind of decoding being used to measure validation performance. Table 9 compares the MLE baseline, unlikelihood with increasing α values, and Nucleus sampling  with hyperparameter p in terms of distribution and F1 score. The vocabulary unlikelihood fine-tuning shifts probability mass from the over-represented frequent words towards under-represented medium and rare words, with the effect strengthening as α increases. At a small cost to perplexity and F1, the unlikelihood tuning reduced the overuse of common tokens by 9 points, matching the human rate, while improving the production of rare tokens by 3 percentage points. Nucleus sampling is a popular method that can also produce generations closer to the human vocabulary distribution. It does this by sampling from the model's probability distribution rather  Table 9: Unlikelihood loss applied to vocabulary distributions. Stronger α terms greatly shift probability mass from the most Frequent words to Medium and Rare words, at a small cost to PPL and F1. Frequent, medium, rare and rarest token classes are defined as the sets of tokens whose cumulative masses account for the top 40%, the next 30%, the next 20% and final 10% of tokens empirically generated by humans, respectively. Nucleus sampling can also produce a distribution close to human with parameter p close to 1, but with larger losses in F1.

B Nucleus Sampling for Vocabulary control
than using beam search, where the sampler restricts to the smallest set of tokens with total mass above a threshold p ∈ [0, 1]. Small values of p are similar to greedy sampling. Increasing p yields distributions closer to human, but with large losses in F1 score, e.g. p = 0.5 has a similar distribution to unlikelihood with α = 10 2 but the F1 scores are 0.160 vs. 0.190. This can be understood because maximizing likelihood during decoding yields better token accuracy than sampling (Welleck et al., 2019a), so the unlikelihood training approach to both use likelihood decoding and match the human distribution can obtain the best of both worlds.

C Human Evaluation
Description of ConvAI2 vocabulary setup We follow (Li et al., 2019) and perform a pairwise comparison with full-length model conversations.
We first collected 252 model-human conversations with each of the models (MLE baseline, and weights for α of Unlikelihood, examples in 8). We then set up a pairwise-comparison using the software of (Li et al., 2019), using the same question ("Who would you prefer to talk to for a long conversation?") and use the exact same quality control question (a baseline greedy model without repetition control, versus a human). We collected ap-proximately 200 preferences per model comparison and filtered annotators who failed quality control.
Description of ELI5 repetition setup We follow  and perform a pairwise evaluation where human annotators were asked "which response answers the question better?" A screenshot of the UI is shown in Figure 9. Human evaluators were asked to rate a total of 5 questions, two of which were quality control annotations. The quality control examples contained the real human responses, along with model predictions: one question contained a baseline model, and one contained an unlikelihood model. Annotators which did not pick humans in quality controls were removed from the final setups. We collected 200 annotations comparing the baseline and the unlikelihood model.

Results
Evaluation results from all evaluated matchups are shown in Figure 10. We find our repetition-controlled ELI5 model significantly outperforms the MLE baseline. We find that two of the vocabulary repetition significantly outperform the MLE baseline. We compute significance with a two-tailed binomial test (p < .01).