VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi


Abstract
Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.
Anthology ID:
2020.emnlp-main.269
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3325–3338
Language:
URL:
https://aclanthology.org/2020.emnlp-main.269
DOI:
10.18653/v1/2020.emnlp-main.269
Bibkey:
Cite (ACL):
Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, and Steven C.H. Hoi. 2020. VD-BERT: A Unified Vision and Dialog Transformer with BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3325–3338, Online. Association for Computational Linguistics.
Cite (Informal):
VD-BERT: A Unified Vision and Dialog Transformer with BERT (Wang et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.269.pdf
Video:
 https://slideslive.com/38938690
Code
 salesforce/VD-BERT
Data
Conceptual CaptionsVisDialVisual Question Answering