A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, Zhou Yu


Abstract
We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Our approach achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets. We also collected a new multilingual multimodal product description dataset to simulate a real-world international online shopping scenario. On this dataset, our visual attention grounding model outperforms other methods by a large margin.
Anthology ID:
D18-1400
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3643–3653
Language:
URL:
https://aclanthology.org/D18-1400
DOI:
10.18653/v1/D18-1400
Bibkey:
Cite (ACL):
Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. 2018. A Visual Attention Grounding Neural Model for Multimodal Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3643–3653, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
A Visual Attention Grounding Neural Model for Multimodal Machine Translation (Zhou et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1400.pdf
Attachment:
 D18-1400.Attachment.zip
Video:
 https://aclanthology.org/D18-1400.mp4
Data
MS COCOMulti30K