USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.


Introduction
The lack of meaningful automatic evaluation metrics is a significant impediment for open-domain dialog generation research. Standard language generation metrics have been shown to be ineffective for dialog evaluation (Deriu et al., 2019;Liu et al., 2016). Without well-accepted, meaningful automatic metrics, open-domain dialog researchers have come to rely on human evaluation. Due to its time-and cost-intensive nature, human evaluation is typically only used for the final dialog model. As such, during development dialog systems are generally optimized for poorly-correlated automatic metrics (e.g., F-1, BLEU, PPL) which can result in sub-par human evaluation scores (Dinan et al., 2019). To facilitate development of opendomain dialog models with meaningful automatic metrics, this paper presents the UnSupervised and Reference free (USR) evaluation metric for dialog.
Standard automatic metrics for evaluating dialog generation (e.g., BLEU, F-1, METEOR, ROUGE) have several shortcomings that make them unsuitable for dialog evaluation: (1) The one-to-many nature of dialog (Zhao et al., 2017) makes wordoverlap metrics ineffective for scoring valid system output that deviates from the ground-truth response (Liu et al., 2016;Gupta et al., 2019). (2) Human evaluation of dialog typically measures multiple properties (e.g., appropriate, interesting, consistent). Automatic metrics on the other hand, condense the multi-faceted nature of dialog quality to a single uninterpretable metric.
(3) There are many definitions of what a good dialog is and, as such, it is not feasible to construct a "one size fits all" metric. Depending on the task and the data, the desired qualities of a dialog system may differ (Walker et al., 1997;Deriu et al., 2019).
USR is a reference-free metric that consists of several interpretable sub-metrics which are combined in a configurable manner. Rather than relying on a ground-truth reference response, unsupervised models are trained to measure desired qualities of dialog (e.g., interesting, natural). As such, USR (1) alleviates the one-to-many issue of standard metrics, (2) produces interpretable measures for desirable properties of dialog, and (3) provides a configurable mechanism for combining several submetrics into an overall quality score.
To evaluate the performance of USR, human quality annotations were collected for models trained on the Topical-Chat (Gopalakrishnan et al., 2019) and the PersonaChat corpora (Zhang et al., 2018). USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level Spearman: 0.42, system-level Spearman: 1.0) and PersonaChat (turn-level Spearman: 0.48 and system-level Spearman: 1.0). The strong correlation with human judgment across two datasets and a variety of model types shows that USR is a valuable tool for the dialog community. Further, since USR does not require any explicit supervision, it has the potential to generalize to several dialog tasks and datasets.
The contributions of this paper as as follows: (1) a strongly-correlated, unsupervised and reference free metric is proposed for evaluating open-domain dialog systems, (2) a thorough human quality annotation is carried out and is released 1 to facilitate future benchmarking of dialog evaluation metrics.

Related Work
Standard automatic metrics for language generation correlate poorly with human judgement of dialog (Liu et al., 2016;Lowe et al., 2017;Gupta et al., 2019). For example, the F-1 score can be gamed by outputting the most frequent words, regardless of the context (Dinan et al., 2019).
The poor performance of present metrics is largely due to the one-to-many nature of dialog (Zhao et al., 2017). To avoid comparing to a single reference response, several authors have proposed using multiple reference responses. Multiple reference responses can be obtained with retrieval models (Galley et al., 2015;Sordoni et al., 2015) or through data collection (Gupta et al., 2019). These multi-reference metrics show improvement in performance, but it is infeasible to thoroughly cover the space of potential responses. As such, this paper addresses the one-to-many issue of dialog by presenting a reference-free metric. Lowe et al. (2017) train ADEM to produce a quality score conditioned on the dialog context, the reference response and the generated response. Venkatesh et al. (2018) present a framework for evaluation of Alexa prize conversations, which attains moderate correlation with user ratings. Both of these methods are trained on explicit quality annotations. In contrast, USR requires no explicit supervision and will more easily generalize to new datasets and tasks. Li et al. (2017) proposes a reference-free dialog evaluator which is trained to discriminate between human and generated responses. This work is similar to USR in that it evaluates the quality of a response without a reference or quality annotation training data. Using the evaluation model as a reward during reinforcement learning exhibited strong performance. However, correlation with human judgement was not evaluated. Intuitively, it appears insufficient to rely on a discriminator as a meaningful evaluation of dialog since this assumes that all human responses are perfect and all generated responses are imperfect. 1 http://shikib.com/usr

Human Quality Annotation
To evaluate the correlation of automatic metrics with human judgment, human quality annotation was carried out across two open-domain dialog corpora. Generated responses were obtained from several models described in Section 3.3. For each dialog context, an additional human response was also written. Human annotation was then carried out on sixty dialog contexts, with six responses per context for Topical-Chat (four system outputs, one newly-annotated human output, one original ground-truth response) and five for PersonaChat (one less system output). Each response was given six different scores: Understandable (0-1), Natural (1-3), Maintains Context (1-3), Interesting (1-3), Uses Knowledge (0-1), Overall Quality (1-5). Three annotators labeled each response.
The task instructions were very detailed in order to minimize subjectivity in the quality annotations. For example, individuals may differ in their definition of Interesting (e.g., some individuals find football interesting, others do not). Thus, the instructions contained a clear, albeit somewhat rigid definition, of Interesting. The instructions for Overall Quality annotation, however, were less rigid and therefore those annotations contain some amount of annotator-specific subjectivity.
The data collection and experiments with Per-sonaChat were carried out to assess the generality of the USR metric. As such, the annotation questions used were specifically tailored to Topical-Chat, but are still suitable for PersonaChat.

Topical-Chat Dataset
The Topical-Chat dataset (Gopalakrishnan et al., 2019) is a large collection of human-human knowledge-grounded open-domain conversations that consists of 11,319 dialogs and 248,014 utterances. Following the same experimental setup as Gopalakrishnan et al. (2019), heuristics are employed to identify the most relevant fact for each response. As such, the task is to produce a response conditioned on both a dialog context and a fact.

PersonaChat Dataset
The PersonaChat dataset (Zhang et al., 2018) is a corpus of human-human persona-conditioned conversations that consists of 10,907 dialogs and 162,064 utterances. Each worker is asked to condition their responses on a persona, which we consider to be analogous to the facts in the Topical- Figure 1: On the Topical-Chat corpus, six responses are obtained for each dialog context. Four use the trained Transformer model with different decoding strategies. One is a new human-generated response. One is the original ground-truth. A similar setup was employed for PersonaChat, albeit with different models. Chat corpus.

Topical-Chat Models
A Transformer (Vaswani et al., 2017) is trained to produce the response, r, conditioned on dialog context, c, and fact, f . The input to the transformer is the concatenation of c and f , similar to Gopalakrishnan et al. (2019). The transformer consists of 6 layers, a hidden size of 512, randomly-initialized word embeddings of size 300, a dropout rate of 0.1 and it is trained for 50 epochs.
A single Transformer model is trained, which matches the automatic metrics reported by Gopalakrishnan et al. (2019). Different decoding strategies are used to obtain four different outputs from this model. In addition to standard argmax sampling, nucleus sampling (Holtzman et al., 2019) is used at three different rates: p = {0.3, 0.5, 0.7}. The outputs from these four decoding strategies are listed with the original ground-truth utterance and a new human-generated response, for a total of six responses for each context, as shown in Figure 1.

PersonaChat Models
Three models were used to generate system outputs: a sequence-to-sequence model (Seq2Seq), an LSTM language model (LM) and a Key-Value Profile Memory Network (KV-MemNN). We use the pre-trained models provided in ParlAI 2 for the ConvAI2 competition (Dinan et al., 2019).
A fourth open-source model was also used to produce output for quality annotation, however it was ultimately excluded from the released dataset and experiments due to possible data leakage.

Annotation
Quality annotation was performed by six dialog researchers. Using a crowdsourcing platform, such as Amazon Mechanical Turk (AMT), would have allowed for more efficient and scalable annotation. However, crowdsourcing was not used because (1) the annotation instructions are lengthy, (2) a preliminary annotation pass was carried out, followed by a group discussion, (3) having many annotations from a few annotators allows examination of annotator-specific subjectivity.
Annotators were provided with a set of instructions (Appendix A). A small preliminary annotation pass was carried out, with each individual annotating 5 dialog contexts (for a total of 30 responses). The inter-annotator agreement was computed for each of the questions. The instructions were refined after the preliminary pass and a discussion meeting (e.g., Maintains Context was changed to be a 3-point rating instead of a 2-point rating). After the instructions were modified, the full annotation pass was carried out.
Each response was rated according to the qualities mentioned at the beginning of this section. Instructions for each of qualities are summarized below: • Understandable (0 -1): Is the response understandable given the previous context?
• Natural (1 -3): Does the response seem to be something that a person would naturally say?
• Maintains Context (1 -3): Does the response serve as a valid continuation of the preceding conversation?
• Uses Knowledge (0 -1): Given the fact that the response is conditioned on, how well does the response use that fact?
• 2 on Maintains Context). These instructions were written to minimize subjectivity in the annotations, which results in clear, agreed upon definitions. For Topical-Chat, the full annotation consisted of 60 dialog contexts randomly sampled from the frequent test set, for a total of 360 responses scored on six different qualities. For PersonaChat, 60 dialog contexts were sampled from the ConvAI2 validation set, with a total of 300 responses scored on six different qualities. Each response was labeled by three different annotators. Annotators were randomly assigned to each dialog context.

Analysis
Inter-annotator agreements for the different ratings across both datasets are presented in Table 1. The correlation between each pair of annotations is computed and the average correlation over all the pairs is reported. Correlation is used instead of Cohen's Kappa in order to better account for the ordinal nature of the ratings (i.e., 4 should correlate better with 5 than 1), and to maintain consistency with the evaluation of the automatic metrics. Most inter-annotator correlations are above 0.4, which indicates moderate to strong agreement. The low agreement for Understandable on PersonaChat is likely a consequence of the simple language in the dataset. Most responses are understandable, except for those requiring background knowledge (e.g., that 'cod' is an acronym for 'Call of Duty'). Since the annotators have differing background knowledge, the few occasions where they fail to understand an utterance will differ, hence the lower agreement. The agreement for Overall Quality is relatively high (0.71 for Topical-Chat and 0.66 for PersonaChat) which suggests that any ambiguity in the specific dialog qualities is mitigated when the annotator is asked for an overall impression. Table 2 presents the scores for the different systems on each of the six qualities. Across both datasets and all qualities, the new human generated response strongly outperforms all other response types, even the original ground truth. This may be because the new human generated response was written with this quality annotation in mind, and as such is optimized for turn-level evaluation. On the other hand, the workers who produced the original ground-truth response, were more concerned with the quality of the overall dialog than with the quality of each individual response.
On the Topical-Chat corpus, argmax decoding has a moderately higher performance over the nucleus sampling (Holtzman et al., 2019) methods. This should not be taken as an indication that argmax decoding is the superior method, since the hyperparameters (e.g., temperature) were not tuned for nucleus sampling. It should be noted that the objective was not to train and evaluate the best performing models, but instead to produce responses of varying qualities and obtain accurate human judgements of these responses.
A regression was trained to map from the five ratings to the overall score in order to analyze the relationship between them. For better interpretability of the regression weights, the scores were normalized (using z-score) before training the regression. For better interpretability, a softmax was computed over the weights. Since individuals may differ in their definition of a good response, a specific regression is trained for each of the five annotators who labeled responses for the Topical-Chat corpus. Figure 2 displays the weights attributed to each of the five qualities by each of the annotators.
Annotators attributed different weights to the specific features. For example, A3 emphasized naturalness while A2 paid more attention to whether a response was grounded on knowledge. Despite the differences across annotators, a good response was generally expected to be natural, maintain context, and be interesting. These annotator-specific weights demonstrate that individuals define good dialog differently. Future work could explore per-   sonalized dialog evaluation wherein the evaluation metric is tailored to a specific individual. A potential criticism of this quality annotation could be that certain dialog qualities are missing. To address concerns about the completeness of the set of five qualities, a regression can be trained to produce the overall score conditioned on the quality ratings. The Spearman correlation between the predicted score and the original overall score is 0.9654, which signifies that the set of qualities is thorough and contains enough information to reflect the overall quality of the response.

Automatic Metrics
This section describes the automatic metrics explored for evaluating generated responses. Section 4.1 describes several existing metrics that were studied. Section 4.2 presents USR, a novel unsupervised and reference-free metric.

Baseline Metrics
Several existing and easily-applicable metrics for dialog evaluation are compared. the list of available metrics is not exhaustive. Only the most commonly used and the most accessible are addressed.
F-1 score computes the word-overlap between the generated response and the ground-truth, by taking the harmonic mean of the precision and recall. It is one of the four metrics used by the creators of the Topical-Chat dataset (Gopalakrishnan et al., 2019), along with perplexity and unique unigram/bigram counts. Dinan et al. (2019) described a simple adversarial example that attains a high F-1 score on PersonaChat. We produce a similar example for the Topical-Chat dataset and find that always outputting a concatenation of the ten most common tokens in the dataset (". i the , that a to it is of") attains an F-1 score of 25.6 which is a +3.6 improvement over the Transformer presented by Gopalakrishnan et al. (2019).
BLEU (Papineni et al., 2002) is a well-known word overlap metric that computes n-gram precision between the generated sequence and the reference. Because precision favors shorter sentences, BLEU also adds a brevity penalty that punishes shorter sentences. BLEU has been found to correlate poorly with human judgment (Liu et al., 2016;Lowe et al., 2017;Gupta et al., 2019).
METEOR (Denkowski and Lavie, 2014) was designed as an improvement on BLEU using a harmonic mean of precision and recall, as well as stemming and synonyms.
ROUGE-L (Lin, 2004) identifies the longest common subsequence between the generated and reference sequence to better account for sentencelevel structure when computing word overlap. Greedy Matching (Rus and Lintean, 2012) is an embedding-based metric that greedily matches each word in the generated sequence to a reference word based on the cosine similarity of their embeddings. The final score is then an average over all the words in the generated sequence.
Embedding Average (Wieting et al., 2015) computes a sentence embedding for both the generated sequence and the ground-truth response by taking an average of word embeddings. The score is then a cosine similarity of the average embedding for both the generated and reference sequence.
Vector Extrema (Forgues et al., 2014) follows a similar setup to Embedding Average, where the score is the cosine similarity between sentence embeddings. Rather than taking an average over word embeddings, this method identifies the maximum value for each dimension of the word embedding. Taking the maximum is motivated by the idea that common words will be de-emphasized as they will be closer to the origin. Vector Extrema has been shown to perform better on dialog tasks than other metrics (Gupta et al., 2019;Liu et al., 2016).
Skip-Thought (Kiros et al., 2015) uses a recurrent neural network to produce a sentence-level embedding for the generated and reference sequences. A cosine similarity is then computed between the two embeddings. The implementation provided by Sharma et al. (2017) is used.
BERTScore (Zhang et al., 2019) uses a pretrained BERT (Devlin et al., 2018) model to greedily match each word in a reference response with one word in the generated sequence. By doing so, it computes the recall of the generated sequence. BERTScore was shown to have strong system-level and segment-level correlation with human judgment on several machine translation and captioning tasks. However, although it is a more sophisticated metric, it still compares word similarity between a reference and a generated sequence. While this method may work well for tasks where there is a limited space of outputs for each input (e.g., captioning, translation), it is ineffective at dealing with the one-to-many nature of dialog.

Proposed Metric
This section proposes describes the USR metric, an unsupervised, reference-free evaluation metric for dialog. USR leverages pre-trained language models, specifically RoBERTa (Liu et al., 2019), to measure properties of dialog. USR is designed to be reference-free because there is no one right answer due to the inherent one-to-many nature of dialog (Zhao et al., 2017).
Several sub-metrics were developed for the different qualities of dialog (e.g., Natural, Interesting, Uses Knowledge). While USR measures the overall quality of a response, its sub-metrics assess specific dialog qualities and therefore facilitate better understanding of a model's performance.

Masked Language Modelling Metric
The masked language modelling (MLM) metric uses a fine-tuned RoBERTa (Liu et al., 2019) model to estimate the likelihood of a response. RoBERTa is pre-trained on a massive amount of English data and fine-tuned on the corpus being evaluated (either Topical-Chat or PersonaChat), making it capable of identifying unnatural and incorrect responses. The likelihood estimated by the fine-tuned RoBERTa model is used as an automatic metric for evaluating the understandability and naturalness of responses.
The RoBERTa-base model ( The language model is fine-tuned on only the dialog, without any of the facts, for a single epoch.
RoBERTa uses both past and future context to predict a probability distribution for a masked word. The input sequence to MLM is a concatenation of a dialog context, c, and a response, r. One word at a time, each word in r is masked and its log likelihood is computed. Given the masked loglikelihood for the i-th word of r as l i , the value of the metric is then computed to be − |r| i l i . Figure  3 visualizes this process.

Dialog Retrieval Metrics
Recent research has highlighted the complementary nature of dialog retrieval and generation with respect to multi-tasking (Wolf et al., 2019b) and pre-training (Mehri et al., 2019). Because of this complimentary nature, using dialog retrieval (DR) for evaluating generative models is an intuitive choice, especially for metrics like Maintains Context and Uses Knowledge.
The fine-tuned RoBERTa model described in Section 4.2.1 is further fine-tuned for the retrieval task. This task is set up in the same manner as the Ubuntu dialog corpus (Lowe et al., 2015). The model is trained given a context x, a response r, and a binary label y indicating whether r is the true response or randomly sampled. The context x may consist of the dialog history and the fact, denoted c, or just the fact, denoted f . Two different versions of the dialog retrieval (DR) metric are trained, with different values of x. The DR metric score is defined to be the probability P (y = 1| x, r) a given DR metric model produces.
Though the DR metric is trained for the task of retrieval, this is done in an unsupervised manner. The retrieval task is an unsupervised task since it requires no additional labels during training (e.g., explicit quality annotations).
The DR metric is appropriate for Maintains Context, Interesting and Uses Knowledge. If a retrieval model predicts that a generated response is contextually relevant to a dialog context, it indicates that the response Maintains Context. Likewise, if a retrieval model predicts that the response r is contextually relevant to fact f , it signifies that r most likely Uses Knowledge.
Interesting is the measure of whether the response is dull/generic or if it provides some interesting/engaging information. The DR metric is trained to distinguish between a ground-truth response (y = 1) and a randomly sampled response (y = 0). Generic responses are applicable to many contexts, and will often appear as both groundtruth responses and randomly sampled responses. As such, the model will likely learn to assign a low probability distribution to these generic responses and will often output P (y = 1| r, x) = 0.5. As such, generic responses will generally be scored lower than other contextually relevant, interesting responses. The DR metrics will learn to favor re-sponses that are unique to a given context x, rather than being applicable to many different contexts.

The USR Metric
Given meaningful automatic metrics for each of the five dialog qualities, USR combines the scores into an overall measure that correlates well with Overall Quality ratings.
In Section 3.5, a regression model was trained to reproduce the overall score from each of the specific quality scores. The predictions of this regression model attained a 0.9654 Spearman correlation with the original scores. This same regression is used by USR on top of the automatic metrics presented in Sections 4.2.1 and 4.2.2.
USR combines its sub-metrics into one measure of overall quality. This combination is configurable, adaptable to different datasets or tasks. For example, if a specific application prefers natural responses over interesting ones, the weights of the regression model can be adjusted. Analysis demonstrated that individuals used different weights when producing the overall score ( Figure 2). USR might be able to be personalized for specific individuals by adjusting the weights of the regression model.

Results
This section evaluates all of the automatic metrics described in Section 4, by comparing them to human judgement. The best sub-metrics for each dialog quality are used as input for the regression model of the USR metric. While the best performing sub-metrics are not consistent across the two datasets, the USR metric nonetheless exhibits strong results. The annotations for the original ground-truth are not used for evaluation, in order to accurately compare referenced and reference-free metrics. Table 3 shows turn-level correlations of the best automatic metrics for each dialog quality on Topical-Chat. USR is shown to strongly outperform both word-overlap and embedding-based metrics across all of the dialog qualities. Interestingly, the best non-USR metric is consistently either ME-TEOR or BERTScore -possibly because both methods are adept at comparing synonyms during evaluation. For some dialog qualities, the overall USR metric outperforms the best sub-metric. For example, USR does better for Maintains Context than USR-DR. This is likely because the information from the other sub-metrics (e.g., Uses Knowledge) is valuable and effectively leveraged by USR.   Table 4 reports the turn-level correlations of the best automatic metrics for each dialog quality on the PersonaChat corpus. Across all dialog qualities, USR strongly outperforms the word-overlap and embedding-based metrics. Conversations in PersonaChat generally consist of individuals communicating facts from their own persona in a relevant and coherent manner. As such, when models trained on PersonaChat produce subpar outputs, it is generally because the outputs either (1) do not effectively use the persona or (2) are not relevant/coherent to the dialog context. This explains why the correlations are significantly higher for Maintains Context and Uses Knowledge. As a consequence of PersonaChat's strong dependency on both the dialog context and the persona, USR-DR (x = c) which uses both the dialog context and the persona to perform dialog retrieval, generally outperforms all other metrics.    the ability of MLM and DR to accurately quantify qualities of a generated response without a reference response, and (2) the ability of USR to effectively combine MLM and DR into a better correlated overall metric. USR shows a similar improvement over all other metrics on PersonaChat, as shown in Table 6. However, DR (x = c) outperforms USR despite the fact that four out of the five sub-metrics input into the USR regression are DR (x = c). This result is probably due to PersonaChat's strong dependancy on both dialog context and persona, both of which DR (x = c) explicitly leverages. We compute the system-level correlation between all automatic metrics and the Overall Quality ratings. USR significantly (p < 0.01) outperforms all other metrics with a Spearman correlation of 1.0 on both datasets and Pearson correlations of 0.92 (Topical-Chat) and 0.82 (PersonaChat). The full set of system-level correlations can be found in the appendix.
These results demonstrate USR's effectiveness. It strongly outperforms other metrics on both turnlevel and system-level correlations. Gopalakrish-nan et al. (2019) use the F-1 score as their primary automatic evaluation metric when presenting Topical-Chat. The results demonstrate a significant difference between USR and the F-1 score, suggesting that USR is a better metric for the Topical-Chat corpus.

Discussion
USR achieves statistically significant correlations with human judgement. The results hold across two datasets, Topical-Chat (Gopalakrishnan et al., 2019) and PersonaChat (Zhang et al., 2018).
USR is configurable. Notably it is composed of several specific dialog quality sub-metrics. These sub-metrics are combined in a configurable manner, using a regression. For other tasks, datasets or even users, this regression can be adjusted, allowing qualities to be removed or re-weighted. Additional sub-metrics could be added.
USR should be used alongside human evaluation. USR was created to facilitate development and tuning of dialog models. As such, USR can be used for model selection and hyperparameter tuning. USR should not be used to claim superior performance over another method.
USR may not work with non-generative models, which were not addressed here. Responses produced by a model that is too similar to the evaluation models (e.g., to DR) are a particular concern.
7 Conclusions This paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. To address the shortcomings of standard metrics for language generation, USR (1) is reference-free, (2) is composed of multiple sub-metrics that evaluate specific qualities of dialog, (3) has a definition of good dialog that is configurable. Thus the metric may be adapted to different tasks and datasets. USR is shown to strongly correlate with human judgment on Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48, systemlevel: 1.0).

Acknowledgements
We thank the following individuals for their help with annotation: Evgeniia Table 3 in the main paper showed turn-level correlations for each specific quality. Due to space limitations, the table only included results for only the best correlated metrics. The full results are shown in Tables 9 -21.

C Code and Data Release
The code for the metrics can be found at https:// github.com/shikib/usr and the human quality annotations can be found at http://shikib.com/ usr. The human quality annotations will allow benchmarking of additional metrics.
Annotation Instructions You will be given a conversation between two individuals. You will then be given several potential responses for the next turn in the conversation. These responses all concern an interesting fact, which will be provided as well. Your task is to rate each of the responses on several metrics. The response for one metric should not influence the other metrics. For example, if a response is not understandable or has grammatical errors -you should try to ignore this when considering whether it maintains context or if it is interesting. Please make sure you read and understand these instructions carefully. Feel free to ask if you require clarification. Please keep this document open while reviewing, and refer to it as needed. Understandable (0-1) Is the response understandable in the context of the history? (Not if its on topic, but for example if it uses pronouns they should make sense) • A score of 0 (no) means that the response is difficult to understand. You do not know what the person is trying to say.
i did n't know that . i love to watch the movie inception , it 's also the first racing movie to be a woman haha . i guess the movie was originally titled " inception " awesome movie ! -Context: in my religion , there is no star . how about you Response: yeah it was back in 1975 .
• A score of 1 (yes) means that the response is understandable. You know what the person is trying to say.
my favorite role would have to be quarterback . it is such an interesting role . that is true . i think lebron is the highest paid celebrity , i wonder if he will be in the space jam sequel .
• A score of 1 (bad) means that the response is unnatural.
-Context: A: wow . do you believe in stars of the zodiac ? what is your star ? B: in my religion , there is no star . how about you Response: yeah , it was back in 1975 . i think he is , he is a great teacher and he also taught ellie kemper , she is a great teacher Annotation Instructions (ctd.) Natural (1-3) (ctd.) Is the response naturally written?
• A score of 2 (ok) means the response is strange, but not entirely unnatural.
-Context: A: wow . do you believe in stars of the zodiac ? what is your star ? B: in my religion , there is no star . how about you Response: i read it sometimes for the fun of it .
• A score of 3 (good) means that the response is natural.
i think it 's funny that the soviet union sent a spacecraft to venus .

Context (1-3)
Does the response serve as a valid continuation of the conversation history?
• A score of 1 (no) means that the response drastically changes topic or ignores the conversation history.
-Context: A: wow . do you believe in stars of the zodiac ? what is your star ? B: in my religion , there is no star . how about you Response: i think it 's funny that the soviet union sent a spacecraft to venus .
• A score of 2 (somewhat) means the response refers to the conversation history in a limited capacity (e.g., in a generic way) and shifts the conversation topic.
-Context: i do like some drama stuff , yeah he was awesome in that . Response: yeah . do you like jon hamm ? -Context: i believe that ! he would have played longer i 'm sure if he did the granny style approach to shooting freethrows ! Response: i agree . did you know that space jam is the highest grossing basketball movie of all time ?
• A score of 3 (yes) means the response is on topic and strongly acknowledges the conversation history.
-Context: B: wow , that 's great . especially because more than of nba players go broke 5 years after retirement . A: i believe that ! he would have played longer i 'm sure if he did the granny style approach to shooting freethrows ! Response: a lot of players can make money by starring in movies . did you know space jam is the highest grossing movie of all time ? maybe one of the broke retired players can be in the sequel ! -Context: B: you like drama ? patrick stewart teaches classes now . i loved him in star trek A: i do like some drama stuff , yeah he was awesome in that . Response: jonn hamm was also a drama teacher . he taught erin from the office Annotation Instructions (ctd.) Uses Knowledge (0-1) Given the interesting fact that the response is conditioned on, how well does the response use the fact?
• A score of 0 (no) means the response does not mention or refer to the fact at all • A score of 1 (yes) means the response uses the fact well Overall Quality (1-3) Given your answers above, what is your overall impression of this utterance?
• A score of 1 (very bad). A completely invalid response. It would be difficult to recover the conversation after this.
• A score of 2 (bad). Valid response, but otherwise poor in quality.
• A score of 3 (neutral) means this response is neither good nor bad. This response has no negative qualities, but no positive ones either.
• A score of 4 (good) means this is a good response, but falls short of being perfect because of a key flaw.
• A score of 5 (very good) means this response is good and does not have any strong flaws.
• A score of 1 (dull) means that the response is generic and dull.
thats true . i agree .
• A score of 2 (somewhat interesting) means the response is somewhat interesting and could engage you in the conversation (e.g., an opinion, thought) my favorite role would have to be quarterback . it is such an interesting role . i love tom brady . i love tom brady .
• A score of 3 (interesting) means the response is very interesting or presents an interesting fact i agree . did you know that space jam is the highest grossing basketball movie of all time ? a lot of players can make money by starring in movies . did you know space jam is the highest grossing movie of all time ? maybe one of the broke retired players can be in the sequel !