Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model

Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based metrics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, researchers have investigated the possibility of assessing response quality without using a set of known correct responses. RUBER demonstrated that an automatic response evaluation model could be made using unsupervised learning for the next-utterance prediction (NUP) task. For the unsupervised learning of such model, we propose a method of manipulating a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high similarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative samples can increase the model’s correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation.


Introduction
Automatic evaluation of responses can be difficult because multiple answers could be suitable for a single context. Well-known metrics often used in machine translation or text summarization, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), or ROUGE (Lin, 2004), are based on measuring n-gram overlap with a set of humanannotated golden answers. Compared to machine Figure 1: Example of the three different types of responses for a given dialogue history. Our method manipulates the original response "What's wrong with heading out with Mark for vacation?" to generate the negative sample "What's wrong? Go out with Mark for dinner." translation or text summarization systems, conversational systems have a wider range of acceptable responses to a given situation (dialogue history). This could explain the low correlation between n-gram-based evaluations and human-conducted evaluations for responses generated by conversation systems, as reported by Liu et al. (2016). They also suggested calculating the embedding similarities between responses and correct answers, and showed that these metrics had a higher correlation with human evaluations than n-gram-based metrics. As this method only rewards responses similar to ones in the fixed set of answer candidates, however, it still fails to account for other possible answers that are dissimilar to the known answers.
To solve this problem,  proposed a supervised regression model that makes predictions independent of correct answer candidates. Although they were able to achieve better correlation with human evaluations, their method depends on procuring a human-annotated dataset to learn from. Tao et al. (2018) used the Next-Utterance Prediction (NUP) task to learn for automatic response evaluation. Their model, which is unsupervised, learned to distinguish an appropriate response from random negative samples (responses randomly taken from the training corpus). The model can evaluate the response quality by estimating the probability that the response occurs directly after the dialogue history. They also demonstrated that the probability-based evaluations highly correlated with human evaluations of response quality.
In this paper, we propose a method to create a negative sample by manipulating a golden response. The manipulation is carried out in three steps: (1) scoring each word, (2) selecting words to replace, and (3) replacing the selected words. In the first step, each word is assigned a score designed to determine how dependent the word is on the context. In the second step, we select all the words with a score above a threshold value, where higher scores indicate higher dependency to the dialogue history. In the third step, all previously selected words are masked and replaced with words predicted in their place by a pretrained language model (LM). Figure 1 shows an example of a negative sample generated by our method. When "What's wrong with heading out with Mark for vacation?" is the golden response, the tokens "with", "heading", "vacation", and "?" were selected and replaced with "?", "Go", "dinner", and ".", in that order.
We find that the model trained with our negative samples alongside random negative samples shows a higher correlation with human evaluations than the models trained only on random negative samples, in experiments using two datasets (Zhao et al., 2020). We also find evidence that automatic evaluation systems trained with the negative samples generated by our proposed method can make decisions closer to human judgment than those without.
The contributions of this paper are as follows: (1) We introduce a method that automatically generates negative samples from the golden responses.
(2) We show that the negative samples can boost unsupervised learning of an automatic response evaluation model with experiment results.
(3) We conducted crowdsourcing and used its results to examine whether the negative samples generated by our method are actually negative. Liu et al. (2016) pointed out that the traditional n-gram overlap based metrics such as BLEU, ME-TEOR, and ROUGE show low correlation with human evaluations when used to evaluate the results of an open-domain conversation system. They suggested measuring the similarity by comparing embeddings of a generated response to those of the golden response. Li et al. (2016) explored dialog system with textual feedback. Ghandeharioun et al. (2019) suggested the necessity of interactive human evaluation for dialogue systems, and proposed a self-play scenario to reduce the burden of human effort. Hashimoto et al. (2019) proposed a method to combine human assessments with the predictions of an evaluation model.  proposed a supervised learning method to predict the quality of a response directly, rather than measuring the similarities with golden responses. Tao et al. (2018) showed that a model trained on the NUP task, in an unsupervised manner, can be used to predict the quality of a response that is generated by a system. Ghazarian et al. (2019) improved the previous work by using contextualized word embeddings. Mehri and Eskenazi (2020) proposed two unsupervised evaluation models: one based on masked language modeling (MLM) and another based on the response retrieval task using a pretrained LM. Pang et al. (2020) predicted the coherence and fluency of a response by estimating its likelihood using a LM. Sai et al. (2020) emphasized the importance of adversarial negative samples for learning response evaluation, and released a dataset with humancurated adversarial negative responses. Their negative samples were manually curated, however, whose process can be both time-consuming and expensive.  attempted to improve the performance of evaluation models for abstractive summarization by corrupting the golden summary and using it as a negative sample. In the machine translation task, Sellam et al. (2020) created paired data with synthetic examples, through methods such as back-translation and mask-filling with BERT (Devlin et al., 2019), and they used the paired data to pretrain the evaluation models. Our work introduces a method to create negative samples by manipulating the golden response to the dialogue history, and also suggests that the negative samples generated by the proposed method could be used to improve the unsupervised response evaluation model. The proposed method can be performed automatically without human effort.

Method
In this section, we describe our method to generate negative samples. The proposed method creates a negative sample by selecting and replacing specific word(s) in a golden response. The word selection is based on the difference between (a) the estimated probability that a word would appear in the response considering the dialogue history and (b) the estimated probability that the word would appear in the response when the dialogue history is not considered. An LM that can perform MLM can be used to estimate these probabilities. Words that have large differences in probability are selected and replaced with other words. When replacing a word with another word, an LM that can perform MLM can be used to predict the word that is most likely to appear in the position of the original word when the dialogue history is not given.

Scoring
The proposed method includes a scoring process to determine which words in the golden response are affected the most by the dialogue history. The score of a word is calculated by taking the difference between (a) the estimated probability of the word appearing in its position when the dialogue history is given and (b) the estimated probability of the word appearing in its position when the dialogue history is not given. This scoring process is performed independently for all words in the target response.
Specifically, to calculate the score of the i-th word (x i ) in the golden response, we first replace x i with the [mask] token. Then the likelihood that the original word x i appears in place of the masked token is calculated twice: once with the dialogue history and once without. The difference in the loglikelihood is used as the final score of each word, which is defined as , where x i denotes the word to be scored, and r /i denotes the sequence of words in the golden response where x i is masked. c denotes the dialogue history of the golden response, and [; ] the concatenation of two pieces of text. P (x i |[c; r /i ]; θ) denotes the estimated probability that x i would occur when the dialog history is considered. P (x i |r /i ; θ) denotes the estimated probability that x i would occur when the dialog history is not considered. θ denotes the parameters of the LM. Figure 2 shows an example of our proposed scoring process. The word "vacation" in the original response received the highest score among the words in the response. The words "with" and "heading" also scored higher than other words.

Selecting
For each sentence, we select words that scored higher than the threshold t. For example, in the case seen in Figure 2, if the threshold is 0.5, the words "with", "heading", "vacation", and "?" will be selected. If none of the words receive a score higher than the threshold value, no words will be selected, and in this case, a negative sample cannot be generated.
We set the threshold t to 0.5 for our experiments. Using this threshold in our dataset, an average of 27.28% of tokens were selected for each response. Also, 94.89% of the responses contained at least one selected word, which means a negative sample could be generated for 94.89% of the cases.

Replacing
The selected words are then replaced using an LM. All selected words are replaced with [mask] tokens in the original response. Then the LM predicts, without considering the dialogue history, the words that are most likely to occur in the location of each masked word. If the LM predicts the original word, the second most likely word is used instead.

Dataset
To measure the correlation between model predictions and human evaluations, we use the responseevaluation dataset proposed by Zhao et al. (2020). The dataset contains dialogue histories, machinegenerated responses, golden responses, and appropriateness scores evaluated by human annotators. The scores were on a 5-point Likert scale, and each response was scored by four annotators. Six generative models, S2S (Sutskever et al., 2014), attentional S2S, HRED (Serban et al., 2016), VHRED (Serban et al., 2017), GPT2-sm and GPT2md (Wolf et al., 2018), with three decoding algorithms, greedy decoding, ancestral decoding, and nucleus sampling (Holtzman et al., 2020), were used to generate the responses. They used Daily-Dialog (Li et al., 2017) and PersonaChat (Zhang et al., 2018). For each dataset, they trained a set of generative conversation models. Each of the 900 context-response pairs was randomly selected from the test set of the two datasets, and the annotators evaluated the appropriateness of each response to the context to construct two different evaluation datasets. The Krippendorff's alpha for this dataset was 0.815, suggesting reasonable inter-annotator agreement.
DailyDialog dataset consists of 13,118 multiturn open-domain conversations written by human workers, and PersonaChat dataset consists of 12,875 multi-turn open-domain conversations written by human workers.

Models
The evaluation models used in the experiment are listed below. Among them, BLEU, ROUGE, ME-TEOR, Embedding Average/Extrema/Greedy, and BERTScore are reference-based metrics that evaluate the quality of a response based on its similarity to the golden response. BERT-MLM, GPT2coherence, BERT-retrieval (random-N), BERT-retrieval (ours) are unreferenced metrics that do not require golden responses. RUBER can be viewed as a hybrid metric that includes both referencebased and unreferenced approaches. Some of the reference-based metrics are simple comparison methods, rather than trainable models, but are presented along with other models because they can also be used to estimate the quality of responses. It should be noted that we do not compare the unsupervised approaches listed below with supervised approaches, such as the ones proposed by ; Zhao et al. (2020), which require human-annotated response-evaluation pairs for training.
BLEU is a widely used metric for the machine translation task by measuring n-gram precision between multiple references and a hypothesis (Papineni et al., 2002).
ROUGE is a widely used metric for text summarization, which measures the n-gram recall (Lin, 2004). We use the F-score of ROUGE-L as an appropriateness score.
METEOR is a metric for the machine translation task, which considers both n-gram precision and n-gram recall of a hypothesis (Banerjee and Lavie, 2005).
Embeddding Average/Greedy/Extrema calculate the similarity between golden and generated responses using the embedding similarity to account for the diverse ways in which the golden response could be stated (Liu et al., 2016).
BERTScore is a recently proposed unsupervised metric based on the contextualized BERT embeddings (Zhang et al., 2020).
RUBER calculates the scores of reference-based and unreferenced metrics individually, then uses them to predict the final score (Tao et al., 2018). The reference-based metric measures the similarity between golden responses and generated responses based on their embedding similarity. The unreferenced metric is trained on the NUP task.
BERT-MLM sums the log-likelihood of each token in a response after masking it using an LM that is fine-tuned on a corpus, then uses the aggregated likelihood as the final score of the response (Mehri and Eskenazi, 2020).
GPT2-coherence measures the coherence between the dialogue history and a response by using a fine-tuned GPT2 model (Radford et al., 2019) to compute the averaged log-likelihood of the response (Pang et al., 2020).
BERT-retrieval (random-N) is a BERT-based model that is trained to distinguish a golden response from a negative sample (Mehri and Eskenazi, 2020), using the dialogue history. We refer to the original model by Mehri and Eskenazi (2020) as BERT-retrieval (random-1) since they used one random response as a negative sample, for a dialogue history. We refer to a variation of the model that uses two random negative samples for a dialogue history, as BERT-retrieval (random-2). This is to fairly compare with our model, which uses two negative samples for a dialogue history, as explained below.
BERT-retrieval (ours) is a model that has the same structure as the BERT-retrieval model. The difference is that our model utilizes the negative samples generated by the method that we propose. The model uses both the generated negative samples and the random negative samples. Specifically, during training, the model learns to distinguish a golden response from two negative samples: one generated from our method and one randomly sampled from the corpus.

Implementation Details
We trained the unreferenced models on the original DailyDialog dataset, and then evaluated them on the two response-evaluation datasets (Section 4.1.1). We split the conversations in the Daily-Dialog dataset in a sliding window manner to construct pairs of dialogue histories and corresponding responses. The maximum turn of the dialogue history was set to 5, following Zhao et al. (2020).
We use the pretrained BERT and GPT2 released by Wolf et al. (2018) for all of our relevant experiments. 2 A BERT model, fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch, was used for the scoring step of our proposed method (Section 3.1). The same model was used for the replacing step (Section 3.3). We used the threshold 3 of 0.5 for the selecting step (Section 3.2). We used Adam optimizer (Kingma and Ba, 2015) for training. We searched for hyperparameters for the BERT-retrieval (random-1) model, that maximize the (Pearson) correlation between human evaluations and model predictions on the responseevaluation dataset made from DailyDialog dataset  (Section 4.1.1). The values found in this search (epoch=3, batch size=64, and learning rate=2e-5) were used for all the BERT-retrieval models (random-N, ours). The random seed was fixed for all experiments.

Results
In Section 4.2.1, we check the correlations between the results of each evaluation model and human evaluations. In Section 4.2.2, an in-depth analysis of our proposed method is shown. In Section 4.2.3 we present examples that may suggest that automatic evaluation systems that have been trained with the proposed method can make deci- We add a noise sampled from N (0, 0.09) into human score for better visualization, following previous studies Bak and Oh, 2020;Pang et al., 2020).
sions closer to human judgment than models that have not. Table 1 shows the correlation between model predictions and human evaluations for each model, based on the two datasets. Pearson correlation (r) and Spearman's rank correlation coefficient (ρ) were used to measure the correlation between human score and model prediction. It should be noted that we excluded the scores of golden responses from the response-evaluation datasets and extracted 800 and 750 response-evaluation pairs from the DailyDialog and PersonaChat datasets, respectively. The model incorporating our negative sample method made predictions with higher correlation with human evaluations than the predictions made by BERT-retrieval (random-2), which uses the same number of negative samples for training. Among the baseline models, most of the reference-based metrics showed comparatively low performances. It is thought that these results support the observations made by previous studies suggesting that using the golden response as the "one and only" correct answer to evaluate responses can be ineffective. RUBER showed better performance than other reference-based mod-els for the DailyDialog dataset, but showed low performance in evaluating PersonaChat responses.

Correlation with Human Judgment
The GPT2-coherence model showed similar performance to the BERT-retrieval (random-1) model on the DailyDialog dataset, but relatively low performance in the PersonaChat dataset. It should also be noted that the hybrid and unreferenced models were trained on the DailyDialog dataset, and not on the PersonaChat dataset. Figure 3 shows a scatter plot visualizing the human scores and model predictions for the responseevaluation dataset on DailyDialog. BLEU tended to predict low scores. This may suggest that there were only a few n-gram overlaps between the golden responses and the generated responses. The predictions of embedding-based metrics (Emb. Greedy and BERTScore) were concentrated on a specific range, and showed low correlation with human scores. The unreferenced or hybrid metrics (RUBER, BERT-MLM, GPT2-coherence, and BERT-retrieval (random-1)) show relatively higher correlations than the reference-based metrics. We can see that BERT-retrieval (ours) shows the greatest correlation among the models, with a correlation coefficient of 0.1974. The scatter plots suggest that false-positive predictions, which frequently occurred in the BERT-retrieval (random-1) predic-tions, occurred less frequently in our model's predictions. However, the scatter plot for our model has a step-function-like appearance. Most of the responses received a score near 0 or near 1, and this is problematic because an ideal model should be able to match human scores even when the scores are moderate. This tendency is considered as a limitation of our model that must be addressed in the future work.

Model Analysis
We analyze our model, by performing experiments with some variations in making the negative samples to be used with the random negative sample: (1) drop-golden: Instead of following the steps of scoring, selecting, and replacing, we randomly drop some of the words in the golden response to create a negative sample, and use it with the random negative sample.
(2) shuffle-golden: Instead of following the three steps, we randomly shuffle the words in the golden response to create a negative sample, and use it with the random negative sample.
(3) score-w/o-history: We use the scoring function in Equation 1 without the first term, so that it only considers the probabilities within the sentence without the dialogue history. (4) select-random: Instead of using the scoring function proposed in Equation 1, we randomly select the words to be replaced. (5) replace-w-history: When replacing a word, we concatenate the dialogue history with the response so that the LM considers the dialogue history when replacing the masked words. Table 2 shows the correlations between model predictions and human evaluations for the modified models above. Dropping or shuffling words in the golden response to make a negative sample shows similar or lower performance compared to using random responses (BERT-retrieval (random-1, random-2)). The correlation was lower when the dialogue history was not considered in the scoring process than when it was considered. We speculate that this is because it gives high scores not only to words important for the consistency of a conversation, but also to the words with low likelihoods in general. Randomly selecting the tokens shows lower correlation than using our proposed scoring function. Considering the dialogue history in the replacing process gives lower performance than when it is not considered. We speculate that providing the dialogue history makes predictions on the masked words that are more appropriate to the context, making the reconstructed response less appropriate as a negative sample.   Figure 5: The POS tag distribution of the words in the original training corpus (left) and the selected words by our method (right).

Case Study
because the response has no bi-grams shared with the golden response. RUBER and GPT2-coherence did not recognize the utterances as appropriate responses. BERT-retrieval (random-1) and BERTretrieval (ours) gave relatively high scores to the responses, evaluating them as appropriate utterances. In the third example, the system response appears to be somewhat relevant to the given context because it includes some words ("chance", "future") relevant to the phrase "take part in the finals". A repetition of a phrase in this example ("to get a chance") is believed to have contributed to the low human evaluation score (0.12). The RUBER and BERT-retrieval (random) models appear to lack this intuition, and instead evaluate the response as appropriate, possibly because some words appear relevant. Our proposed model scored the response with a relatively low score of 0.15, which was close to the human score. In the fourth example, the response is not coherent, but because it begins with a sentence "Let me get a peek", it could have appeared as a coherent response to the previous dialogue about parking tickets. For this case, our proposed model and GPT2-coherence gave scores similar to human scores.

POS-tag distribution of selected words
We compute the Part-of-Speech (POS) tag distribution of selected words by our method and compare it with the original distribution of the DailyDialog corpus ( Figure 5). 4 As we can see, the VERB and NOUN tags are the most frequently selected (21.9% and 20.5%, respectively), and their ratio is increased than in the original corpus (18.3% and 16.7%, respectively). Meanwhile, the ratio of punctuation tag (.) is highly decreased (from 21.3% to 12.1%). We suspect that the likelihood of the punctuation tag is more affected by local information from a response rather than dialog history.

Are the generated samples actually inappropriate?
To see whether the negative samples generated by our method are actually inappropriate, we conducted a survey through Amazon Mechanical Turk (AMT). We selected 40 dialogue history examples and prepared three types of responses for each dialogue: 1) the golden response, 2) a negative sample generated by our method, and 3) a randomly selected negative sample from the corpus. For each dialog, 4 annotators were asked to score the quality of the three responses. Following , we asked the question "How appropriate is the response overall?" for each context-response pair, and the evaluation was conducted on a 5-point Likert scale. The Fleiss' kappa and Krippendorff's alpha for the annotations were 0.63 and 0.63, respectively. Figure 6 shows the survey results. The mean scores of golden and random responses were 4.65 and 1.19, respectively. The mean score of our negative samples was 2.51. The standard deviations for the scores of each response type were 0.67, 1.27, and 0.41 for the golden response, our negative sample, and the random response, respectively. We see that these results do not guarantee that all the generated negative samples are inappropriate. What we can assume, however, is that our method of manipulating a golden response generates a negative sample that is more inappropriate than the golden response. Table 3 shows two examples of the three different types of responses for a given dialog history with their survey results.  Table 3: Examples of three different types of responses for a given dialog history with their survey results. The highlighted words are newly generated by our method. The score of each response is underlined.
For a model learning to find the difference between appropriate and inappropriate responses, we speculate that the task of distinguishing the negative samples generated by our method from the golden responses would be more difficult than the task of distinguishing the randomly selected negative samples from the golden responses. We believe that this is because the generated negative samples can be inappropriate in more subtle ways than completely unrelated responses are. We suspect that learning with this more challenging setting have resulted in the performance gain that we discussed in Section 4.2.1. However, we believe that it will need a more in-depth semantic analysis on each of the cases, such as performing a more quantitative analysis (through an extensive human study, for instance) and further interpretation of the semantic relationships between the original golden responses and the modified negative samples according to the proposed method. We leave it as a future work.

Conclusion
In this paper, we proposed an automatic method for generating negative samples that can be used to train an unsupervised and unreferenced response evaluation model. We performed experiments to demonstrate that the proposed method can boost the unsupervised training of a response evaluation model. We analyzed the experiment results quantitatively, and examined some examples that show the distinct characteristics of our proposed method.