Varying image description tasks: spoken versus written descriptions

Emiel van Miltenburg, Ruud Koolen, Emiel Krahmer


Abstract
Automatic image description systems are commonly trained and evaluated on written image descriptions. At the same time, these systems are often used to provide spoken descriptions (e.g. for visually impaired users) through apps like TapTapSee or Seeing AI. This is not a problem, as long as spoken and written descriptions are very similar. However, linguistic research suggests that spoken language often differs from written language. These differences are not regular, and vary from context to context. Therefore, this paper investigates whether there are differences between written and spoken image descriptions, even if they are elicited through similar tasks. We compare descriptions produced in two languages (English and Dutch), and in both languages observe substantial differences between spoken and written descriptions. Future research should see if users prefer the spoken over the written style and, if so, aim to emulate spoken descriptions.
Anthology ID:
W18-3910
Volume:
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
88–100
Language:
URL:
https://aclanthology.org/W18-3910
DOI:
Bibkey:
Cite (ACL):
Emiel van Miltenburg, Ruud Koolen, and Emiel Krahmer. 2018. Varying image description tasks: spoken versus written descriptions. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 88–100, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Varying image description tasks: spoken versus written descriptions (van Miltenburg et al., VarDial 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3910.pdf
Code
 cltl/Spoken-versus-Written
Data
Flickr30kMS COCOPlaces205