Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Weixin Liang; James Zou; Zhou Yu

doi:10.18653/v1/2020.acl-main.126

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Abstract

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.

Anthology ID:: 2020.acl-main.126
Volume:: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2020
Address:: Online
Editors:: Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1363–1374
Language:
URL:: https://aclanthology.org/2020.acl-main.126/
DOI:: 10.18653/v1/2020.acl-main.126
Bibkey:
Cite (ACL):: Weixin Liang, James Zou, and Zhou Yu. 2020. Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1363–1374, Online. Association for Computational Linguistics.
Cite (Informal):: Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation (Liang et al., ACL 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.acl-main.126.pdf
Video:: http://slideslive.com/38928690
Code: Weixin-Liang/dialog_evaluation_CMADE

PDF Cite Search Code Video Fix data