Learning the Human Judgment for the Automatic Evaluation of Chatbot

Shih-Hung Wu; Sheng-Lun Chien

Learning the Human Judgment for the Automatic Evaluation of Chatbot

Abstract

It is hard to evaluate the quality of the generated text by a generative dialogue system. Currently, dialogue evaluation relies on human judges to label the quality of the generated text. It is not a reusable mechanism that can give consistent evaluation for system developers. We believe that it is easier to get consistent results on comparing two generated dialogue by two systems and it is hard to give a consistent quality score on only one system at a time. In this paper, we propose a machine learning approach to reduce the effort of human evaluation by learning the human judgment on comparing two dialogue systems. Training from the human labeling result, the evaluation model learns which generative models is better in each dialog context. Thus, it can be used for system developers to compare the fine-tuned models over and over again without the human labor. In our experiment we find the agreement between the learned model and human judge is 70%. The experiment is conducted on comparing two attention based GRU-RNN generative models.

Anthology ID:: 2020.lrec-1.198
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1598–1602
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.198
DOI:
Bibkey:
Cite (ACL):: Shih-Hung Wu and Sheng-Lun Chien. 2020. Learning the Human Judgment for the Automatic Evaluation of Chatbot. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1598–1602, Marseille, France. European Language Resources Association.
Cite (Informal):: Learning the Human Judgment for the Automatic Evaluation of Chatbot (Wu & Chien, LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.198.pdf

PDF Cite Search