Measuring the ‘I don’t know’ Problem through the Lens of Gricean Quantity

We consider the intrinsic evaluation of neural generative dialog models through the lens of Grice’s Maxims of Conversation (1975). Based on the maxim of Quantity (be informative), we propose Relative Utterance Quantity (RUQ) to diagnose the ‘I don’t know’ problem, in which a dialog system produces generic responses. The linguistically motivated RUQ diagnostic compares the model score of a generic response to that of the reference response. We find that for reasonable baseline models, ‘I don’t know’ is preferred over the reference the majority of the time, but this can be reduced to less than 5% with hyperparameter tuning. RUQ allows for the direct analysis of the ‘I don’t know’ problem, which has been addressed but not analyzed by prior work.


Introduction
Neural generative dialog models have a tendency to produce generic, safe responses, such as 'I don't know ' (Serban et al., 2016;Li et al., 2016a). The repetition of such phrases is annoying to users, and contributes nothing to the conversation.
Evaluating chatbots is an active area of research, partly due to their open-ended nature (Hashimoto et al., 2019;Sedoc et al., 2019;Li et al., 2019;Mehri and Eskenazi, 2020b;Deriu et al., 2020). To the best of our knowledge, no prior work focuses on analyzing systems for generic, safe responses, such as 'I don't know.' While prior work (Li et al., 2016a,b;Csáky et al., 2019;Welleck et al., 2020) addresses the 'I don't know' problem, the lack of analysis leaves it unclear if a method improves models by mitigating this problem, or another.
One linguistic framework for analyzing conversations is Grice's Cooperative Principle (1975), which consists of Maxims of Conversation that function as guidelines for effective communication. Grice considered conversations between humans, but there has also been some exploration in NLP (Bernsen et al., 1996;Harabagiu et al., 1996;Qwaider et al., 2017;Jwalapuram, 2017).
We discuss each of the categories of maxims and the ways a chatbot might violate them in Table 1.
We propose a novel automatic diagnostic inspired by the Gricean QUANTITY maxim. Relative Utterance Quantity checks if the model favors a generic response (such as 'I don't know.') over the reference it was trained on for each prompt. We apply our diagnostic to a method designed to address this problem (Csáky et al., 2019), and find that method does mitigate it, though not by as much as a hyperparameter search.  2 Relative Utterance Quantity (RUQ) If a system responds 'I don't know.' when it could have given a better or more informative answer, this is by definition a violation of QUANTITY. Based on this interpretation we propose a method for diagnosing the problem. We compare the model score of producing 'I don't know.' to the model score of producing the reference response. This can be done on the training data, or the test data. Particularly on the training data, we should expect the model to 'know' the data it was trained on and therefore score it higher than 'I don't know. ' We propose two diagnostic measures to compute the Relative Utterance Quantity of a model: (1) We plot the average model score for each token across sentences. We compare the original reference, beam search output, and two 'I don't know' (IDK) variants: 'I don't know.' and 'I don't know what to do.' allowing for the visualization of the relative gap in scores at different points in the sentence.
(2) We compute the (length normalized) model score for 'I don't know.' and the reference of each training prompt, and count how many times the reference is preferred. We denote the later as RUQ score. Both generalize to other generic responses, as might be appropriate for other corpora or other languages.
If there are multiple references we would recommend comparing the lowest likelihood reference for RUQ score, since all valid references should be better than I don't know.
We note that RUQ captures some types of QUANTITY violations, but not all violations of this maxim.

Data
Following Khayrallah and Sedoc (2020), we train and evaluate on DailyDialog (Li et al., 2017), 1 which consists of ∼ 80,000 turns of English-learners practicing 'daily dialogues' in various contexts, e.g., chatting about vacation or food.
We also use Entropy-Based Data Filtering (Csáky et al., 2019), which filters out high entropy utterances 2 with the goal of removing generic ones. We use the recommended filtering threshold of 1 As released by ParlAI (Miller et al., 2017). The ParlAI release of DailyDialog is tokenized and lowercased. Following Khayrallah and Sedoc (2020) we detokenize and recase the DailyDialog data for training. 2 Prompts that solicit many different responses and responses that can apply to many different prompts.
1.0 and 'IDENTITY' clustering. We filter based on their 'source', 'target', and 'both' settings. We consider 'target' as the baseline, as they find it works best. We denote models trained on DailyDialog as DD and models trained on Csáky et al.'s entropy filtered version as EF.

Standard Automatic Metrics
We use the single-reference and multi-reference 3 automatic evaluation framework for DailyDialog released by Gupta et al. (2019), 4 which is computed using NLG-EVAL (Sharma et al., 2017). 5 We primarily consider multi-reference METEOR (Lavie and Agarwal, 2007); see Appendix A.7 for all metrics. 6

Human Evaluation
For human evaluation of the different systems we use crowdworkers on Amazon Mechanical Turk to judge the fluency, coherence, and interestingness of utterances on a 1-5 Likert scale (see Appendix A.4 for full details) for 100 randomly sampled evaluation set prompts. Four annotators judge the responses from all systems for each prompt in a single turn context. We remove any annotators with a linear Cohen's Kappa < 0.1 from the results.  We denote models trained with the FLORES hyperparameters as BASE, and the best model from the hyperparameter searches for each data type (as selected by multiple-reference METEOR) as BEST. We report the multi-reference METEOR scores for the BASE and BEST sysems in Table 2. 10 For the DailyDialog data we find that hyperparameter tuning can improve multiple-reference METEOR from 12.7 (DD-BASE) to 17.8 (DD-BEST).

Models
We perform the same hyperparameter sweep after performing entropy filtering (Csáky et al., 8 For example popular toolkits for dialog (e.g., Hugging Face (Wolf et al., 2020) and ParlAI (Miller et al., 2017)) do not implement label smoothing. 9 See Appendix A.1 for more hyperparameter details. 10 We report hyperparameters of these models and their performance on the full set of automatic metrics in § A.7. 2019) on the data, but we find that the best model is still DD-BEST. Without hyperparameter tuning, entropy filtering improves performance by ∼0.5 on multi-reference METEOR, but the improvement by hyperparameter sweeping is much larger (5.1 points). 11 We did a very thorough sweep (including values we expected to perform poorly), which led to some general takeaways: 12 Using a subword vocabulary (of 4-8k) is helpful. (2) Label smoothing interacts with subword vocabulary size, but is also helpful.
6 Relative Utterance Quantity

RUQ Plots
We show plots for the four models in Figure 1. We plot the token normalized model score for reference and 'I don't know.' For additional comparison, we also plot the model scores for the 11 We note that Csáky et al. (2019)-who proposed entropy filtering and an observed a 1 BLEU point improvement from using it (we observed a 0.3 improvement in single reference BLEU)-did not use any subwords units; they used a total vocab size of 16k. Our 10 best systems all had Sentencepiece vocab sizes of 2k, 4k, or 8k, so perhaps this difference may explain the discrepancy between their results and our replication. We note that for the 3 metrics which we believe our evaluations are comparable-single reference Embedding Average Cosine Similarity, and single reference Vector Extrema Cosine Similarity-our baseline outperforms their results. The BLEU scores are not directly comparable because they report sentence BLEU, while we report corpus BLEU following Gupta et al. (2019)

beam-search output and 'I don't know what to do.'
Overall, we observe that for the BASE models the IDKs are higher probability than the reference, even on the training data. This is problematic, because the model is ranking a response that is not providing enough QUANTITY of information higher than the reference despite the fact that it should 'know' the training data. The relative difference in probabilities is much better in DD-BEST than DD-BASE, particularly on the training set. Simply entropy filtering the data alone does not fix the problem.

RUQ scores
We summarize QUANTITY in a single statistic by counting how many times the reference has a higher probability than 'I don't know.' on the training data. Entropy filtering improves how often the reference is preferred to 'I don't know.', but not by as much as the hyperparameter sweep does, see Table 3 for the RUQ scores on the training data. 13 For both DD-BASE and EF-BASE, IDK is preferred over the reference response the model was trained on over half of the time (71.5% for DD, 62.1% for EF). 13 RUQ scores on the on the test data are reported in § A.7. The overall trend is same, but the absolute values lower. Table 4 shows human judgments of fluency, coherence, and interestingness. 14 The models trained on DailyDialog have higher fluency and coherence, while the models trained on the filtered data have higher interestingness. For both kinds of data, the hyperparameter tuning (as selected by METEOR) improved interestingness. Fluency did not change. Coherence was reduced for the filtered models and improved for the base model. Improved RUQ may be reflected in either interestingness or coherence, but other factors can influence those judgments. Therefore, measuring RUQ directly is important to measuring progress on the IDK problem.

Discussion
The relative RUQ rankings of the four systems we consider in this work are the same as the relative rankings by multi-reference METEOR, and DD-BEST (the single best model according to mulit-reference METEOR) is also the one with the highest RUQ score. Among all models in the hyperparameter sweep, RUQ is correlated with METEOR with Spearman's ρ of 0.9 but this drops to 0.6 when considering only the top 20 systems, demonstrating that RUQ and METEOR do not capture the same phenomenon. We note that RUQ on the training data does not require a particular (multi-reference) test set like most automatic evaluation metrics. RUQ simply diagnoses how well the model learned the training data compared to a generic response.
The model's relative preference of IDK over the (presumably) better reference response is not only a QUANTITY violation, but is also indicative of a fundamental problem with the models themselves, and should be fixed before decoding time (either by correcting the data, or by correcting the model).
Csáky et al. (2019) argue that the IDK problem is due to the one-to-many/many-to-one nature of dialog training data-if a single response applies to many different responses, it will become the canonical response. Therefore their entropy filtering method removes one-to-many/many-toone pairs, by removing high entropy responses. While this data filtering reduces the problem, we found that the baseline model trained on the entropy filtered data (EF-BASE) still preferred IDK over the reference the majority of the time, suggesting opportunities for future research on the IDK problem.

Related Work
Gricean Maxims in NLP Gricean maxims have previously been discussed in NLP. Bernsen et al. (1996) examine the relationship between a new set of maxims for human-bot dialogs and relate them to Gricean maxims. They point out that these do not entirely overlap; however, the maxim of Quantity is preserved since unambiguous contributing responses are required in conversations in general. (Harabagiu et al., 1996) attempt to explicitly create an evaluation methodology using sets of primitive rules and WordNet. Our approach is different as RUQ is a diagnostic metric.
Jwalapuram (2017)  Chatbot evaluation Automatic evaluations for dialog typically measure lexical or semantic similarity between a produced response and a reference, under the assumption that the reference is a good response and responses similar to it will be good as well. Since there are often multiple valid responses to a prompt, this can be extended to multiple references too. In contrast, in our work we compare a model's score of a reference to a model's score of a generic response for directed analysis.
HUSE (Hashimoto et al., 2019) uses the model score combined with human judgments to evaluate diversity and quality, classifying a response as human-or machine-generated. Our work does not require human judgments, and compares the model score of a generic response to the reference response.
Mehri and Eskenazi (2020a) also use scoring from a model. Whereas that work is using an external model, we propose an intrinsic diagnostic for a particular phenomenon. Each serves a different purpose, and an advantage of our method is our analysis does not require an external model, which might not be available in all languages and for all types of text.  (Welleck et al., 2020). In our work, we propose an intrinsic model diagnostic to analyze the problem.

Mitigating the IDK Problem
MMI Maximum Mutual Information was proposed as a 'Diversity-Promoting Objective Function' for dialog (Li et al., 2016a). MMI-bidi encourages the prompt to be predictable from the response, by using a reverse direction model. We argue this was not diversity broadly speaking, but actually tackling a RELEVANCY problem, since it is scoring how predictable the prompt is from the response.
Li et al. demonstrate MMI improves performance, though recent work found that it does not always (Khayrallah and Sedoc, 2020). Ott et al. (2018) found that copying was overrepresented in the output of RNN NMT. Using an analysis that inspired RUQ plots they compare the score of the beamsearch output to that of the copied source. They also consider the probability at each position in the output, and find the model is unlikely to start copying; however, after starting to copy continuing to copy has high probability. We find IDK has a relatively high score from the start, though for some models the gap widens towards the end of the sentence.

Conclusion
We reframe the IDK problem as a violation of the Gricean maxim of QUANTITY, and introduce a new measure-Relative Utterance Quantity (RUQ)which allows researchers to diagnose if their model is violating this particular conversational principle, and analyze methods that aim to address it.
We aim to encourage further discussion and research drawing on linguistic principles about discourse and pragmatics for analysis of dialog models. for inclusion were over 500 approved HITs, an approval rate over 98%, and location set to US. Each HIT was paid $0.15 with an overlap of 4 annotators per HIT. 19 A screenshot of the HIT is in Figure 2.

A.5 Head to Head Human Evaluation
In addition to the point-wise evaluation, we also test head-to-head pairwise performance on the evaluation set of 480 unique prompt/response pairs, as shown in § A.5. Models trained on the DailyDialog data outperform the filtered models, but there is no clear preference between base and best models.