Image Position Prediction in Multimodal Documents

Masayasu Muraoka, Ryosuke Kohita, Etsuko Ishii


Abstract
Conventional multimodal tasks, such as caption generation and visual question answering, have allowed machines to understand an image by describing or being asked about it in natural language, often via a sentence. Datasets for these tasks contain a large number of pairs of an image and the corresponding sentence as an instance. However, a real multimodal document such as a news article or Wikipedia page consists of multiple sentences with multiple images. Such documents require an advanced skill of jointly considering the multiple texts and multiple images, beyond a single sentence and image, for the interpretation. Therefore, aiming at building a system that can understand multimodal documents, we propose a task called image position prediction (IPP). In this task, a system learns plausible positions of images in a given document. To study this task, we automatically constructed a dataset of 66K multimodal documents with 320K images from Wikipedia articles. We conducted a preliminary experiment to evaluate the performance of a current multimodal system on our task. The experimental results show that the system outperformed simple baselines while the performance is still far from human performance, which thus poses new challenges in multimodal research.
Anthology ID:
2020.lrec-1.526
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4265–4274
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.526
DOI:
Bibkey:
Cite (ACL):
Masayasu Muraoka, Ryosuke Kohita, and Etsuko Ishii. 2020. Image Position Prediction in Multimodal Documents. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4265–4274, Marseille, France. European Language Resources Association.
Cite (Informal):
Image Position Prediction in Multimodal Documents (Muraoka et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.526.pdf