The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.


Introduction
Automatic description generation from real-world images has emerged as a key task in vision & language in recent years Devlin et al., 2015;Vinyals et al., 2015;Bernardi et al., 2016), and datasets like Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014) or Microsoft CoCo (Lin et al., 2014;Chen et al., 2015) are typically considered to be general benchmarks for visual and linguistic image understanding. By exploiting these sizeable data collections and recent advances in computer vision (e.g. ConvNets, attention, etc.), image description models have achieved impressive performance, at least for indomain training and testing on existing benchmarks.
Nevertheless, the actual linguistic definition and foundation of image description as a task remains unclear and is a matter of ongoing debate, e.g. see (van Miltenburg et al., 2017) for a conceptual discussion of the task from a cross-lingual perspective. According to (Bernardi et al., 2016), image description generation involves generating a textual description (typically a sentence) that verbalizes the most salient aspects of the image. In practice, however, researchers have observed that eliciting descriptions from naive subjects (i.e. mostly crowd-workers) at a consistent level of quality is a non-trivial task (Rashtchian et al., 2010), as workers seem to interpret the task in different ways. Thus, previous works have developed relatively elaborate instructions and quality checking conventions for being able to systematically collect image descriptions.
In this paper, we argue that problems result from the fact that the task is typically put to the workers without providing any further context. This entirely monological setting essentially suggests that determining the salient aspects of an image (like highly important objects, object properties, scene properties) can be solved in a general, "neutral" way, by humans and systems. We present ongoing work on collecting image descriptions in task-oriented dialogue where descriptions are generated collaboratively by two players. Importantly, in our setting (which we call the MeetUp! environment), image descriptions serve the purpose of solving a higher-level task (meeting in a room, which in the game translates to determining whether an image that is seen is the same as the one that the partner sees). Hence, our participants need not be instructed explicitly to produce image descriptions. In this collaborative setting, we observe that the notion of saliency is non-static throughout a dialogue. Depending on the history of the interaction, and the current state, speakers seem to flexibly adjust their descriptions (ranging from short scene descriptions to specific object descriptions) to achieve their common goal. Moreover, the descriptions are more factual than those collected in a monological setting. We believe that this opens up new perspectives for image captioning models, which can be trained on data that is bounded to its contextual use.

Related Work
As described above, the fact that the seemingly simple task of image captioning can be interpreted differently by crowd-workers has already been recognised in the original publications describing the datasets (Hodosh et al., 2013;Young et al., 2014;Chen et al., 2015). However, it has been treated as a problem that can be addressed through the design of instructions (e.g., "do not give people names", "do not describe unimportant details", (Chen et al., 2015)). (van Miltenburg et al., 2016;van Miltenburg, 2017) later investigated the range of pragmatic phenomena to be found in such caption corpora, with the conclusion that the instructions do not sufficiently control for them and leave it to the labellers to make their own decisions. It is one contribution of the present paper to show that providing a task context results in more constrained descriptions. Schlangen et al. (2016) similarly noted that referring expressions in a corpus that was collected in a (pseudo-)interactive setting (Kazemzadeh et al., 2014), where the describers were provided with immediate feedback about whether their expression was understood, were more concise than those collected in a monological setting (Mao et al., 2016).
Similar to MeetUp, the use of various dialogue game set-ups has lately been established for dialogue data collection. Das et al. (2017) designed the "Visual Dialog" task where a human asks an agent about the content of an image. De Vries et al. (2017) similarly collected the GuessWhat? corpus of dialogues in which one player has to ask polar questions in order to identify the correct referent in the pool of images. de Vries et al. (2018) also develop a new navigation task, where a "tourist" has to reach a target location via communication with a "guide" given 2D images of various map locations. While similar in some respects, MeetUp is distinguished by being a symmetrical task (no instruction giver/follower) and broader concerning language data (more phenomena such as repairs, strategy negotiation).

Data collection 3.1 MeetUp image descriptions
The MeetUp game is a two-player text-based communication game set in a visual 2D environment. 1 The game starts with two players being placed in different 'rooms'. Rooms are represented to the players through images. 2 Each player only sees their own location. The objective of the game is to find each other; that is, to be in the same room. To solve this task, players can communicate via text messages and move freely (but unnoticed by the other player) to adjacent rooms. In the process of the game, the players naturally produce descriptions of what they currently seeand, interestingly, sometimes of what they have previously seen-to determine whether they have reached their goal or not. When they think that they have indeed achieved their goal, they indicate this via a particular command, and the dialogue ends.
The corpus we use here consists of 25 MeetUp (MU) games, collected via crowd-sourcing with Amazon Mechanical Turk. Workers were required to be native speakers of English. The dialogues all end with a matching phase where the players try to establish whether they are in the same room, by exchanging descriptions and come to the conclusion that they are (correctly in fact, in all but one dialogue). In some games, the players earlier already suspected to be in the same room and had such a "matching phase", but concluded that they weren't.
The complexity of the game board is likely to have an influence on the shape of the dialogue. For this data collection, we handcrafted a set of game boards to contain a certain degree of room type redundancy (e.g., more than one bed room per game board) and varying levels of overall complexity, as indicated in Table 2.
For our investigations here, we take these "matching phase" sub-dialogues and the images 1. Modern kitchen with grey marble accents featuring The popular stainless steel appliances. 2. Modern kitchen with stainless steel appliances well decorated 3. This kitchen looks very beautiful I can eat off the floors that's how clean it looks. 4. A very clean looking kitchen, black and silver are the color theme. Looks like it is in an expensive place.    that they are about (note that for the non-matching situations, there are two images for one subdialogue), to give us a set of 33 images together with corresponding utterances. We will call these utterances dialogical image descriptions (DDs), in contrast to the monological image descriptions (MDs) described in the next subsection.
An example of such a description is shown in Table 1. From left to right columns represent line number in a dialogue, timestamp of a message, messages private to player A, messages seen by both, private messages of player B. Lines 60-72 in the transcript contain part of the dialogue where players act on suspicion that they might be in the same location and start describing images presented to them individually. In earlier stages of the dialogue (lines 31 and 59), this room had already been referenced to. It indicates that the player keeps a memory of what has already been mentioned and can refer back to that.

Monological image descriptions
In order to compare dialogical descriptions with data produced in a typical non-context caption environment, we also collected MDs on Amazon Mechanical Turk (AMT). We presented workers with the 33 images and instructed them to produce captions for them. We adopted the instructions from the MS COCO collection (Chen et al., 2015), which ask workers to "describe the important parts" of the image, and, importantly, to provide at least eight (8) words per image description. We collected four captions per image; and thus 132 captions overall. An example of four monological image descriptions for one image is shown in Figure 1.

Analysis
An important task is to determine what types of referring expressions are present in the datasets. In order to identify and analyse referring expressions, we used the brat annotation tool (Stenetorp et al., 2012) to tokenise and annotate both DDs and MDs. The first author of the paper annotated whether an utterance contains descriptions of scene with objects (a kitchen with wood floors), or objects only (a white marble dining table), expresses players' actions (moving north now) or is  Table 3: Analysis of image descriptions related to players' beliefs about their current state (I think we are in the same room). For our analysis we define referring expressions (REXs) as nominal phrases that refer to the objects in the scene (four chairs) or to the scene itself (a kitchen). Additionally, we identified parts of speech in both DDs and MDs using Stanford Log-linear Part-Of-Speech Tagger (Toutanova et al., 2003). Examples of REXs according to this definition are displayed in bold in Figure 1 for MDs and in Table 1 for DDs. Table 3 gives some basic statistics about the two data sets. The goal is to look at the task dependence of image descriptions. Each MU dialogue can be divided into phases, two of which are exemplified in Table 1. The roaming phase (part of it are lines 31, 59) is typically filled with movements and players informing each other about their location. The matching phase (lines 60-72) ends the dialogue with the determination that the two players are present in the same place. In order to demonstrate dynamics of interactions in MeetUp, we look at all DDs as well as at their statistical characteristics in two phases. There were almost two times more DDs than MDs overall, with matching phase requiring a high number of descriptions as well. At the same time, MDs tend to be longer than all DDs, though both sets have nearly identical type/token ratio.

Referring expressions in MDs and DDs
When looking at the number of REXs in Table  3, in the MeetUp set-up players produced almost four times more overall referring expressions than the workers that produced the MD set. The majority of these occurred in the matching phases, which indicates that the different subgoals between phases have an influence. There were also nearly two times as many REXs per individual description in the MeetUp setting than in the monological descriptions. Additionally, given the fact that MeetUp descriptions are generally more condensed than the MDs (8.64 vs. 12.5), it appears that MDs contain much material not directly relevant for reference to the scene or its objects. In particular we observed that there are on average 11 words in MDs (88%) which are not REXs and thus not related to an image, while there are nearly only 6 (70%) non-REX words in DDs. MeetUp players also produce longer REXs and this parameter is stable for all MeetUp phases. These observations show that the MeetUp descriptions are more focused on the task, less broad, contain much more referring expressions, which are longer then the ones in the non-task-driven set-up. Table 4 displays the most frequent adjectives in both datasets in the spirit of (Baltaretu and Ferreira, 2016), who compared type and frequency of adjectives in a similar task design. It clearly shows a trend that seems to be present in the overall data: MDs cover a broader range of object properties or image attributes than the DDs.  For example, evaluative adjectives (beautiful, nice) appear very often in MDs, while none of them is observed for the DDs. The latter ones seem to concentrate on attributes like colour, size, position, qualities of objects, while monological captions additionally have adjectives which refer to age, feelings, a number of objects in the scene. Furthermore, 78 adjectives occur only once among all words in MDs, while this number is almost half that for the DDs (38). It additionally supports the idea that absence of the task makes humans to produce broad image descriptions, which are not necessarily grounded in scene objects.

Conclusion
The task of collecting appropriate training data for image caption generation systems, and language & vision in general, is not a trivial one. We found that in a standard crowdsourcing-based collection procedure, annotators tend to produce interpretative, non-factual descriptions, leading to potentially unsystematic or noisy data. We have presented a task-oriented interactive set-up for data collection where image descriptions are naturally used by speakers to solve a higher level task. Our data collected in a small-scale pilot study indicates that dialogical image descriptions consistently lead to factual descriptions containing many more reasonable referring expressions than monological descriptions. The analysis presented here will be used to further control MeetUp! data collection in order to avoid data that is similar to nontask-drive monological captions.