Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose a method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.


Introduction
Automatic evaluation for open-domain dialogue generation systems has a potential for driving their research and development because of its high reproducibility and low cost. However, existing automatic evaluation metrics, such as BLEU (Papineni et al., 2002), correlate poorly with human evaluation (Liu et al., 2016). This poor correlation arises from a nature of dialogue, that is, there are many acceptable responses to an input context, known as the one-to-many problem (Zhao et al., 2017).
To tackle this problematic issue, we focus on evaluating response generation systems via response selection. In this task, systems select an appropriate response for a given context from a set of response candidates. Each candidate has the label that indicates whether the candidate is appropriate response for the given context. Traditionally, response selection has been used to evaluate retrieval-based dialogue systems (Lowe et al., 2015;Wu et al., 2017). We consider applying this task to driving the research for dialogue generation  Figure 1: Overview of the construction method of our test set. First, we retrieve only utterances related to the ground-truth response from a repository. Then, we remove acceptable utterances by human evaluation.
systems. Specifically, we consider using response selection to pick out promising systems that should be evaluated more precisely by humans among a lot of candidate systems. We assume that response selection is a valid option for such a preliminary evaluation on the basis of the following assumption: systems that can generate appropriate responses can also select appropriate responses. One advantage of evaluating generation systems via response selection is that it can remedy the one-to-many problem, because we do not have to consider the appropriate responses that are not included in sets of response candidates. Another advantage is that it enables a simple and clear comparison between systems in accuracy.
Generally, false response candidates are randomly sampled from a repository (Lowe et al., 2015;Gunasekara et al., 2019), which causes two problems: (i) unrelated false candidates and (ii) acceptable utterances as false. The first problem is that randomly sampled false candidates are often too far from ground-truth responses. Consider the case where for a given context "Do you have a car?", a response candidate "I play tennis." is ran-domly sampled. Systems can easily recognize this candidate as a false one because there are no related content words between them. Such excessive easiness is not preferable because the performance gap between good and inferior systems tends to be small. The second problem is that there is no guarantee that randomly sampled candidates are always unacceptable ones. For example, "I don't know." is often sampled as a false response because this phrase often occurs in open-domain dialogues. This phrase can be regarded as acceptable for various contexts. These two problems make general response selection test sets unreliable.
In this work, we propose a method to construct response selection test sets with well-chosen false candidates ( Figure 1). First, we retrieve only utterances related to the ground-truth response. Then we remove acceptable utterances by human evaluation. Through experiments, we demonstrate that automatic evaluation using the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU. Our empirical results indicate that response selection with wellchosen false candidates can be a valid option for evaluating response generation systems. We will release the test set used in the experiments. 1 2 Related Work Automatic evaluation metrics Various metrics have been proposed for automatic evaluation of dialogue systems, such as BLEU, METEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), Greedy Matching (Rus and Lintean, 2012), and Vector Extrema (Forgues et al., 2014). These metrics evaluate the quality of the responses generated by systems. However, this is challenging due to the oneto-many problem. For example, ADEM, a metric proposed by (Lowe et al., 2017), is easily fooled by adversarial examples (responses) (Sai et al., 2019). To remedy one-to-many problem, we focus on evaluating systems via response selection.
Response selection test sets with human labels One popular test set for response selection is Douban Conversation Corpus in Chinese (Wu et al., 2017). In this test set, each response candidate has a manually annotated label that indicates whether or not the candidate is appropriate for the given context. Although this test set is similar to ours, there are some differences between the purposes and procedure of test set designs. The purpose of creating their test set is to simulate and evaluate retrieval-based dialogue systems. Thus, all the candidates in this corpus are retrieved by using the context as queries, as retrieval-based systems do. In this paper, we develop an English response selection test set with human labels to evaluate dialogue generation systems. One of the salient differences from Douban Conversation Corpus is the procedure of retrieving false candidates. We retrieve false candidates using the ground-truth responses. By this method, we can more certainly collect false candidates that are related to ground-truth responses and facilitate error analysis as described in Section 4.3.

Construction Method
For each context c and ground-truth response r true , we construct a set of false response candidates r false ∈ R false by retrieving utterances from an utterance repository u ∈ U. As we mentioned in Section 1, we want to filter out some types of utterance: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. We filter out such utterances as follows: 1. Retrieve M utterances, {u 1 , · · · , u M }, related to the ground-truth response r true from the utterance repository U. 2. Remove acceptable ones from the retrieved utterances by human evaluation.
1. Retrieve utterances related to the groundtruth response We assume that utterances related to the ground-truth response share some similar content words between them. Here, we retrieve the related utterances on the basis of the similarities of the content words. This process makes it difficult for systems to distinguish between groundtruth and false candidates only by comparing the content words.
2. Remove acceptable utterances Coincidentally, some of the retrieved utterances may be acceptable as an appropriate response. To remove such utterances, we ask human annotators to evaluate each retrieved utterance. Specifically, we instruct five annotators (per candidate) to score each retrieved candidate in a five-point scale from 1 to 5. A score of 5 means that the utterance can clearly be regarded as an appropriate response for the given context, whereas a score of 1 means that it cannot be regarded as an appropriate one at all. In addition to the scores, we also instruct annotators to give a score of 0 to ungrammatical utterances. We remove the utterances that are given a score of 3 or higher by three or more annotators because these utterances with a high score can be acceptable. In addition, we remove the utterances that are given a score of 0 by three or more annotators because these are likely to be ungrammatical ones. We also instruct annotators to score ground-truth responses, combining them with retrieved utterances. We remove the questions if the score of the ground-truth response is low, i.e., three or more annotators give a score of 3 or lower. This is intended to ensure that ground-truth responses are certainly appropriate for the given context.

Overview of Constructed Test Set
Settings of test set construction We retrieve 10 utterances (per question) from the repository and remove acceptable ones following the method described in Section 3.1. We use crowdsourcing 2 to score the retrieved utterances. After removing acceptable utterances, there are some questions that have 6 or more available false candidates. From these questions, we develop new questions with the same context but different candidates (both groundtruth responses and false candidates). We regard one of acceptable utterances removed by human evaluation as the ground-truth responses of new questions. We use the dialogue data from DailyDialog  to construct the test set. We extract the four beginning turns of each dialogue sample from DailyDialog, regarding the fourth utterance as the ground-truth response. We extract the utterances of OpenSubtitles2018 (Lison et al., 2018) to construct the repository used to retrieve false candidates. Note that the repository does not contain the utterances in the dialogue data used to train response generation systems in Section 4.1.

Statistics of our test set
We developed the test set that consists of 1, 019 questions with 4 candidates (1 ground-truth + 3 false candidates).  regard the scoring as binary classification (scores higher than 3 are regarded as appropriate responses, and the others not), the Fleiss' Kappa of the scoring is 0.63, which is higher than Douban Conversation Corpus (0.41). Table 2 shows an example of our test set. All the false response candidates share the same content word "focus" related to the topic "camera".

Preliminary experiments
We conducted a simple experiment to investigate whether or not a system that takes only content words into account can recognize false response candidates in our test set. For the model, we used the TF-IDF model (Lowe et al., 2015), which simply compares between content words of a given context and each candidate. As a result, the accuracy was 0.461. For a comparison, we also replaced all the false candidates in our test set with randomly sampled utterances. The accuracy of the same TF-IDF model increased to 0.671. These results indicates that it is difficult to recognize false candidates in our test set only by comparing content words.

Experiments
We test whether the automatic evaluation of response generation systems on our test set correlates with human evaluation.

Experimental Procedure
We train multiple response generation systems and rank them on the basis of human and automatic evaluation scores. By comparing between the system ranking by human scores and the ranking by each automatic score, we verify the correlations.

Response Generation Models
We train 10 different response generation systems to be ranked in the experiments. Their architectures are ones of Seq2Seq with GRU (Cho et al., 2014), Seq2Seq with LSTM (Hochreiter and Schmidhuber, 1997), or Transformer (Vaswani et al., 2017). Some systems have same architecture, but different hyper-parameters. 4 We train the models on OpenSubtitles2018. The training data consists of 5M samples and the validation data consists of 0.05M samples, each of which is four-turns dialogue.

Evaluation Procedure Ground-truth system ranking by human scores
The trained systems generate a response r gen for each input context c ∈ C. Then, five human annotators (per response) score each generated response r gen in a five-point scale from 1 to 5. A score of 5 means that the response can clearly be regarded as an appropriate response for the given context, whereas a score of 1 means that it cannot be regarded as an appropriate one at all. As a result, we obtain five scores, {s 1 , s 2 , · · · , s 5 }, for each response r gen and average them: s mean = mean(s 1 , s 2 , · · · , s 5 ). We also average s mean across all the questions in the test set and yield the final score s final for each system. Based on this score, we make a ranking of the systems and regard it as the ground-truth ranking.
Although we developed the test set that consists of 1,019 questions, it is too costly to evaluate all the 10 systems' responses for 1,019 questions by humans. Thus we give the context of 56 randomly sampled questions from our test set to the 10 systems as inputs C.
System ranking by response selection accuracy We rank the systems by response selection accuracy with well-chosen false candidates (CHO-SEN). The trained response generation systems compute the softmax cross-entropy loss r for each response candidate r ∈ R. We regard the candidate with the lowest loss as the system's selection: 4 We describe the model settings in Appendix B.

Metrics
Spearman p-value  Table 3: Correlations between the ground-truth system ranking and the rankings by automatic evaluation. r = argmin r∈R r . From the predictions, we calculate accuracy and make a ranking of the systems based on the accuracy. For comparison, we also make a ranking by response selection accuracy with randomly sampled false candidates (RANDOM). 5 We compute the accuracy of CHOSEN and RANDOM using all 1, 019 questions from our test set.

System ranking by other evaluation metrics
For comparison, we also make rankings of the systems by three existing automatic evaluation metrics: BLEU, METEOR, and ROUGE-L. First, the trained systems generate a response for each input context. Then we compute the scores comparing generated responses and the ground-truth responses.
These scores can be computed automatically without false candidates. Thus we compute them using all 7, 393 available four-turns dialogue samples from DailyDialog, regarding the fourth utterances as the ground-truth responses.

Results
We compare the rankings by Spearman's rank correlation coefficients, shown in Table 3. First, we yielded the human upper bound. we evaluated the correlation between the rankings made by different annotators (HUMAN). We randomly divided human evaluation into two groups and made two rankings. The correlation coefficient between the two rankings was 0.87. Second, we found that the rankings made using existing automatic evaluation metrics correlate poorly with ground-truth ranking. BLEU, often used to evaluate generation systems, does not correlate with human evaluation at all. One exception is ROUGE-L. However,  its correlation coefficient is lower than 0.4, which means reasonable correlation. Third, we found that the ranking made by using our test set reasonably correlates with the ground-truth ranking compared with other metrics, and the correlation coefficient (CHOSEN) is higher than 0.4.

Discussion
Instability of evaluation with random sampling The correlation coefficient of the ranking by response selection with randomly sampled false candidates (RANDOM) is higher than that of BLEU and slightly lower than that of CHOSEN. However, a serious problem has been observed: the instability. We make 100 test sets, each of which consists of different false candidates by random sampling with different seeds. For each test set, we make a system ranking and compute its coefficient. Figure  2 shows the box plot of the Spearman's rank correlation coefficients of the trials. The range of the coefficients is very wide (0.06-0.67). This result means that the quality of evaluation with randomly sampled false candidates strongly depends on the sampled candidates, which is the uncontrollable factor stemming from the randomness.
Interpretable error analysis Our automatic evaluation with well-chosen false candidates brings another benefit: the interpretable error analysis. Table 4 shows an example of a question of our test set. The well-chosen false candidate (CHOSEN) is similar to the ground-truth response. However, the grammatical subject of the CHOSEN sentence is "You", which completely mismatches the context. Thus if systems select this false candidate, they may lack the ability to determine correctly the subject of sentences. In this way, our test set enables us to analyze systems' predictions from various meaningful perspectives. As a case study, we design a set of error labels, each of which indicates why the false candidate is false, and assign them to 50 false candidates in our test set. We succeed in assigning the labels to 22 out of 50 candidates. 6 Limitation Our test set is designed to evaluate open-domain dialogue generation systems. Thus, it is not suitable for evaluating other types of dialogue system such as task-oriented ones. By contrast, existing automatic evaluation metrics, such as BLEU, do not have this type of restriction.

Conclusion
In this paper, we focused on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we proposed a method to construct response selection test sets with well-chosen false candidates. Specifically, we proposed to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. We demonstrated that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with that of widely used metrics such as BLEU.
In the future, we will provide labels that indicate "Why this candidate is false" for false candidates in our test set, so that one can easily detect weak points of systems through error analysis.

A Methods to Retrieve False Candidates
To make false candidates in each pool diverse, we use two retrieval methods: lexical retrieval and embedding-based retrieval. We use Lucene 7 for lexical retrieval, and cosine similarity of sentence vectors for embedding-based retrieval. Sentence vectors are SIF (Arora et al., 2017) weighted average of ELMo word vectors (Peters et al., 2018).
to 50 false candidates from our test set. We could eventually assign the labels to 22 candidates. The types of our error labels and the breakdown are listed in Table 7. The examples of false candidates (CHOSEN) corresponded to the error labels are shown in Table 4 (for labeled "Responses that have wrong subjects"), Table 8, Table 9, and Table 10.

Error label Count
Inconsistent responses with the context 8 Responses that have insufficient information 4 Responses that have wrong subjects 9 Responses with wrong tense 1 Table 7: Error labels and the breakdown of the the assigned labels.

Context:
A: 911 emergency. What is the problem? B: I would like to report a break-in. A: When was this break-in?
Candidates: Ground-Truth: I believe it happened last night. CHOSEN: I thought that would happen last night.

Context:
A: What's the matter with you, Paul? B: I'm not feeling well. I think I'm having a cold. A: Looks like it. You need to drink a lot of water and take a good rest.