History for Visual Dialog: Do we really need it?

Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, Verena Rieser


Abstract
Visual Dialogue involves “understanding” the dialogue history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to accurately generate the correct response. In this paper, we show that co-attention models which explicitly encode dialoh history outperform models that don’t, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowdsourcing dataset collection procedure, by showing that dialogue history is indeed only required for a small amount of the data, and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisdialConv) of the VisdialVal set and the benchmark NDCG of 63%.
Anthology ID:
2020.acl-main.728
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8182–8197
Language:
URL:
https://aclanthology.org/2020.acl-main.728
DOI:
10.18653/v1/2020.acl-main.728
Bibkey:
Cite (ACL):
Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, and Verena Rieser. 2020. History for Visual Dialog: Do we really need it?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8182–8197, Online. Association for Computational Linguistics.
Cite (Informal):
History for Visual Dialog: Do we really need it? (Agarwal et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.728.pdf
Video:
 http://slideslive.com/38928892
Code
 shubhamagarwal92/visdial_conv +  additional community code
Data
VisDialVisPro