CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.


Introduction
Building agents that can interact with humans in natural language while perceiving and taking actions in their environments is one of the fundamental goals in artificial intelligence. To this end, it will be necessary to ground language into perception and action (Harnad, 1990;Barsalou, 1999), * The first two authors contributed equally to this work. † Work performed while the authors were interns at Facebook AI Research. where, e.g., nouns are connected to percepts and verbs relate to actions taken in an environment. Some approaches judge machine understanding of language purely based on the ability to mimic particular human utterances, but this has limitations: there are many ways to express roughly the same meaning, and conveying the correct information is often more important than the particular choice of words. An alternative approach, which has recently gained increased prominence, is to train and evaluate language capabilities in an interactive setting, where the focus is on successfully communicating information that an agent must share in order to achieve its goals.
In this paper, we propose the Collaborative Drawing (CoDraw) task, which combines grounded language understanding and learning effective goal-driven communication into a single, unified testbed. This task involves perception, communication, and actions in a partially observable environment. As shown in Figure 1, our game is grounded in a virtual world constructed from clip art objects (Zitnick et al., 2013;Zitnick and Parikh, 2013). Two players, Teller and Drawer, play the game. The Teller sees an abstract scene consisting of clip art objects in a semantically meaningful configuration, while the Drawer sees a drawing canvas that is initially empty. The goal of the game is to have both players communicate so that the Drawer can reconstruct the image of the Teller, without ever seeing it.
Our task requires effective communication because the two players cannot see each other's scenes. The Teller must describe the scene in sufficient detail for the Drawer to reconstruct it, which will necessitate grounded language. Moreover, the Drawer will need to carry out a series of actions from a rich action space to position, orient, and resize all of the clip art pieces required for the reconstruction. Note how clip art pieces form a representation that is perceived visually by humans but is easy to manipulate in a structured manner, in contrast to lower-level pixel-based image representations. The performance of a pair of agents is judged based on the quality of reconstructed scenes, where high-quality reconstructions result from successful communication.
We collect a CoDraw dataset 1 of ∼10K variable-length dialogs consisting of ∼138K messages with the drawing history at each step of the dialog. We also define a similarity metric for clip art scenes, which allows us to automatically evaluate the effectiveness of agent communication at the end of a dialog and at intermediate states. We evaluate several Drawer and Teller models 2 automatically as well as by pairing them with humans, and show that long-term planning and contextual reasoning are key challenges of the CoDraw task.
As we developed models and protocols for CoDraw, we found it critical to train the Teller and the Drawer separately on disjoint subsets of the training data. Otherwise, the two machine agents may conspire to successfully achieve the goal while communicating using a shared "codebook" that bears little resemblance to natural language. We call this separate-training, joint-evaluation protocol crosstalk, which prevents learning of mutually agreed upon codebooks, while still checking for goal completion at test time. We highlight crosstalk as one of our contributions, and believe it can be generally applicable to other related tasks (Sukhbaatar et al., 2016;Foerster et al., 2016;de Vries et al., 2016;Das et al., 2017b;Lewis et al., 2017).

Related work
Language grounded in environments. Learning language games in a grounded environment has been studied recently (Wang et al., 2016(Wang et al., , 2017. While language in these works is tied to actions that modify the environment, the tasks do not involve multiple agents that need to cooperate. Other work on grounded instruction following relies on datasets of pre-generated action sequences annotated with human descriptions, rather than using a single end goal (Long et al., 2016). Generation models for these tasks are only evaluated based on their ability to describe an action sequence that is given to them (Fried et al., 2018a), whereas Teller models for CoDraw also need to select in a goal-driven manner the action sequence to describe to the Drawer. Language grounding has been studied for robot navigation, manipulation, and environment mapping (Tellex et al., 2011;Mei et al., 2015;Daniele et al., 2016). However, these works manually pair each command with robot actions and lack end-to-end training (Tellex et al., 2011), dialog (Mei et al., 2015Daniele et al., 2016), or both (Walter et al., 2014. Compared to work on navigation (Vogel and Jurafsky, 2010; Anderson et al., 2018;Fried et al., 2018b) where an agent must follow instructions to move itself in a static environment, CoDraw involves a structured action space for manipulating clip art pieces to form a semantically meaningful configuration. End-to-end goal-driven dialog.
Traditional goal-driven agents are often based on 'slot filling' (Lemon et al., 2006;Wang and Lemon, 2013;Yu et al., 2015), in which the structure of the dialog is pre-specified but the individual slots are replaced by relevant information. Recently, end-to-end neural models are also proposed for goal-driven dialog (Bordes et al., 2017;Li et al., 2017a,b;He et al., 2017), as well as goal-free dialog or 'chit-chat ' (Shang et al., 2015;Sordoni et al., 2015;Vinyals and Le, 2015;Li et al., 2016;Dodge et al., 2016). Unlike CoDraw, in these approaches, symbols in the dialog are not grounded into visual objects.
Emergent communication. Building on the seminal works by Lewis (1969Lewis ( , 1975, a number of recent works study cooperative games between agents where communication protocols emerge as a consequence of training the agents to accomplish shared goals (Sukhbaatar et al., 2016;Foerster et al., 2016). These methods have typically been applied to learn to communicate small amounts of information, rather than the complete, semantically meaningful scenes used in the CoDraw task. In addition, the learned communication protocols are usually not natural (Kottur et al., 2017) or interpretable, whereas the CoDraw task is designed to develop agents that use human language.
Language and vision. The proposed CoDraw game is related to several well-known language and vision tasks that study grounded language understanding (Karpathy and Fei-Fei, 2015;Donahue et al., 2015;de Vries et al., 2016). For instance, in contrast to image captioning (Vinyals et al., 2017;Xu et al., 2015;Chen and Zitnick, 2015;Lu et al., 2017), visual question answering (Antol et al., 2015;Zhang et al., 2016;Goyal et al., 2016;Gao et al., 2015;Krishna et al., 2017;Malinowski and Fritz, 2014;Ren et al., 2015;Tapaswi et al., 2016;Yu et al., 2015;Zhu et al., 2016) and recent embodied extensions (Das et al., 2018), CoDraw involves multiple rounds of interactions between two agents. Both agents hold their own partially observable states and may need to build a model of their partner's state to collaborate effectively. Compared to past work on generating abstract scenes from single captions (Zitnick et al., 2013), scenes in CoDraw are reconstructed over multiple rounds, and the task requires Teller models to generate coherent and precise descriptions over the course of a full dialog. Compared to visual dialog (Das et al., 2017a,b;Strub et al., 2017;Mostafazadeh et al., 2017) tasks, agents need to additionally cooperate to change the environment with actions (e.g., move pieces around). Thus, the agents have to possess the ability to adapt and hold a dialog about partially-constructed scenes that will occur over the course of their interactions. In addition, we also want to highlight that CoDraw has a well-defined communication goal, which facilitates objective measurement of success and enables end-to-end goal-driven learning.

CoDraw task and dataset
In this section, we first detail our task, then present the CoDraw dataset, and finally propose a scene similarity metric which allows automatic evaluation of the reconstructed and original scene.

Task
Abstract scenes. To enable people to easily draw semantically rich scenes on a canvas, we leverage the Abstract Scenes dataset of Zitnick et al. (2013) and Zitnick and Parikh (2013). This dataset consists of 10,020 semantically consistent scenes created by human annotators. An example scene is shown in the left portion of Figure 1. Most scenes contain 6 objects (min 6, max 17, mean 6.67). These scenes depict children playing in a park, and are made from a library of 58 clip arts, including a boy (Mike) and a girl (Jenny) in one of 7 poses and 5 expressions, and various other objects including trees, toys, hats, animals, food, etc. An abstract scene is created by dragging and dropping multiple clip art objects to any (x, y) position on the canvas. Spatial transformations can be applied to each clip art, including sizes (small, normal, large) and two orientations (facing left or right). The clip art serve simultaneously as a highlevel visual representation and as a mechanism by which rich drawing actions can be carried out. Interface. We built a drag-and-drop interface based on the Visual Dialog chat interface (Das et al., 2017a) (see Figures 5 and 6 in Appendix A for screen shots of the interface). The interface allows real-time interaction between two people. During the conversation, the Teller describes the scene and answers any questions from the Drawer on the chat interface, while Drawer "draws" or reconstructs the scene based on the Teller's descriptions and instructions. Each side is only allowed to send one message at a time, and must wait for a reply before continuing. The maximum length of a single message is capped at 140 characters: this prevents excessively verbose descriptions and gives the Drawer more chances to participate in the dialog by encouraging the Teller to pause more frequently. Both participants were asked to submit the task when they are both confident that Drawer has accurately reconstructed the scene of Teller.  To focus the natural language on the high-level semantics of the scene rather than instructions calling for the execution of low-level clip art manipulation actions, the Teller is not able to observe the Drawer's canvas while communicating.

Dataset
We collect 9,993 3 dialogs where pairs of people complete the CoDraw task, consisting of one dialog per scene in the Abstract Scenes dataset. The dialogs contain of a total of 138K utterances and include snapshots of the intermediate state of the Drawer's canvas after each round of each conversation. See Section 5 for a description of how we split the data into training, validation, and test sets.
Messages. Figure 2a shows the distribution of message lengths for both Drawers and Tellers. The message length distribution for the Drawer is skewed toward 1 with passive replies like "ok", "done", etc. There does exist a heavy tail, which shows that Drawers ask clarifying questions about the scene like "where is trunk of second tree, low or high". On the other hand, Teller utterances have a median length of 16 tokens and a vocabulary size of 4,555. Due to the limited number of clip arts, the vocabulary is smaller than it would be for real images. However, humans still use compositional language to describe clip art configurations and attributes, and make references to previous discourse elements in their messages.
Rounds. Figure 2b shows the distribution of the numbers of conversational rounds for dialog sessions. Most interactions are shorter than 20 rounds; the median number of rounds is 7.
Durations. In Figure 2c we see that the median session duration is 6 minutes. We had placed a 20-minute maximum limit on each session. 3 Excluding 27 empty scenes from the original dataset.

Scene similarity metric
The goal-driven nature of the CoDraw task naturally lends itself to evaluation by comparing the reconstructed scene to the original. For this purpose we define a scene similarity metric, which allows us to automatically evaluate communication effectiveness both at the end of a dialog and at intermediate states. We use the metric to compare how well different machine-machine, human-machine, and human-human pairs can complete the task.
We represent a scene C as a set of clip art objects c ∈ C, each of which consists of an identifier id(c) that denotes its type, and additional features such as size and x, y position. We denote by ids(C) the set of clip art types that occur in the scene. Given two scenes, the intersection-overunion measure computed over clip art types is: (1) where n int (n union ) is the numbers of clip art types in the intersection (union).
To also incorporate features such as size and position, we replace the indicator function in the numerator with a term g(i, C,Ĉ) that measures attribute similarity for shared clip art types. We also introduce a pairwise similarity term h(i, j, C,Ĉ). Overall, scene similarity is defined as: (2) The denominator terms normalize the metric to penalize missing or extra clip art, and we set g and h such that our metric is on a 0-5 scale. The exact terms g and h are described in Appendix B.

Models
We model both the Teller and the Drawer, and evaluate the agents using the metric described in the previous section. Informed by our analysis of the collected dataset (see Section 3.2), we make several modeling assumptions compared to the full generality of the setup that humans were presented with during data collection. These assumptions hold for all models studied in this paper. Assumption 1: Silent Drawer. We choose to omit the Drawer's ability to ask clarification questions: our Drawer models will not generate any messages and our Teller models will not condition on the text of the Drawer replies. This is consistent with typical human replies such as "ok" or "done" (around 62% of human Drawer replies only use a single token) and the fact that the Drawer talking is not strictly required to resolve the information asymmetry inherent in the task. We note that this assumption does not reduce the number of modalities needed to solve the task: there is still language generation on the Teller side, in addition to language understanding, scene perception, and scene generation on the Drawer side. Drawer models that can detect when a clarification is required, and then generate a natural language clarification question is interesting future work. Assumption 2: Full clip art library. The other assumption is that our drawer models can select from the full clip art library. Humans are only given access to a smaller set so that it can easily fit in the user interface (Zitnick and Parikh, 2013), while ensuring that all pieces needed to reconstruct the target scene are available. We choose to adopt the full-library condition as the standard for models because it is a stricter evaluation of whether the models are able to make correct grounding decisions.

Rule-based nearest-neighbor methods
Simple methods can be quite effective even for what appear to be challenging tasks, so we begin by building models based on nearest-neighbors and rule-based approaches. We split the recorded human conversations available for training into a set of conversation rounds R (possibly from different dialogs), where at each round r ∈ R: • Teller sends a message m r • Drawer removes clip art pieces C (−) r • Drawer adds clip art pieces C (+) r • Drawer replies or ends the conversation Rule-based nearest-neighbor Teller. Our first Teller model uses a rule-based dialog policy where the Teller describes exactly one clip art each time it talks. The rule-based system determines which clip art to describe during each round of conversation, following a fixed order that roughly starts with objects in the sky (sun, clouds), followed by objects in the scene (trees, Mike, Jenny), ending with small objects (sunglasses, baseball bat). The message for each object c is then copied from a nearest neighbor in the data: where s is the scene similarity metric from Section 3.3. This baseline approach is based on the assumptions that the Drawer's action was elicited by the Teller utterance immediately prior, and that the Teller's utterance will have a similar meaning when copied verbatim into a new conversation and scene context. Rule-based nearest-neighbor Drawer. This Drawer model is the complement to the rule-based nearest-neighbor Teller. It likewise follows a fixed rule that the response to each Teller utterance should be the addition of a single clip art, and uses a character-level string edit distance d to select which clip art object to add to the canvas:

Neural Drawer
Our second Drawer model is based on the neural network architecture shown in the left portion of Figure 3. At each round of conversation, the Drawer conditions on the Teller's last message, which is encoded into a vector using a bidirectional LSTM. The Drawer also uses as input a vector that represents the current state of the canvas. These vectors are then processed by a dense feed-forward neural network to produce a vector that represents the Drawer's action, which consists of adding a (possibly empty) set of clip art pieces to the drawing. It is trained using a combination of cross-entropy losses (for categorical decisions  Figure 3: A sketch of our model architectures for the neural Drawer and Teller. The Drawer (left) conditions on the current state of the canvas and a BiLSTM encoding of the previous utterance to decide which clip art pieces to add to a scene. The Teller (right) uses an LSTM language model with attention to the scene (in blue) taking place before and after the LSTM. The "thought bubbles" represent intermediate supervision using an auxiliary task of predicting which clip art have not been described yet. In reinforcement learning, the intermediate scenes produced by the drawer are used to calculate rewards. Note that the language used here was constructed for illustrative purposes, and that the messages in our dataset are more detailed and precise.

Teller Drawer
such as which clip art pieces to add and what orientation to use) and L 2 losses that penalizes placing pieces at distant (x, y) coordinates; see Appendix C for details.

Neural Teller: scene2seq
For our neural Teller models, we adopt an architecture that we call scene2seq (right portion of Figure 3). This architecture is a conditional language model over the Teller's side of the conversation with special next-utterance tokens to indicate when the Teller ends its current utterance and waits for a reply from the Drawer. 4 The language model is implemented using an LSTM, where information about the ground-truth scene is incorporated at both the input and output of each LSTM cell through the use of an attention mechanism. Attention occurs over individual clip art pieces: each clip art in the ground-truth scene is represented using a vector that is the sum of learned embeddings for different clip art attributes (e.g. e type=Mike , e size=small , etc.) At test time, the Teller's messages are constructed by decoding from the language model using greedy word selection.
To communicate effectively, the Teller must keep track of which parts of the scene it has and has not described, and also generate language that is likely to accomplish the task objective when interpreted by the Drawer. We found that training the scene2seq model using a maximum likeli-4 Though none of the models in this paper handle language in the Drawer replies, these can be incorporated into the scene2seq framework similar to the approach of Lewis et al. (2017). hood objective did not result in long-term coherent dialogs for novel scenes. Rather than introducing a new architecture to address these deficiencies, we explore reducing them by using alternative training objectives. To better ensure that the model keeps track of which pieces of information it has already communicated, we take advantage of the availability of drawings at each round of the recorded human dialogs and introduce an auxiliary loss based on predicting these drawings. To select language that is more likely to lead to successful task completion, we further fine-tune our Teller models to directly optimize the end-task goal using reinforcement learning.

Intermediate supervision
We incorporate state tracking into the scene2seq architecture through the use of an auxiliary loss. This formulation maintains the end-to-end training procedure and keeps test-time decoding exactly the same; the only change is that during training, at each utterance separator token, the output from the LSTM is used to classify whether each clip art in the ground truth has been drawn already or not. Here we make use of the fact that the CoDraw dataset records human drawer actions at each round of the conversation, not just at the end. The network outputs a score for each clip art type, which is connected to a softmax loss for the clip art in the ground truth scene (the scores for absent clip arts do not contribute to the auxiliary loss). We find that adding such a supervisory signal reduces the Teller's propensity for repeating itself or omitting objects.

Reinforcement learning
The auxiliary loss helps the agent be more coherent throughout the dialog, but it is still an indirect proxy for the end goal of having the Drawer successfully reconstruct the scene. By training the agents using reinforcement learning (RL), it is possible to more directly optimize for the goal of the task. In this work we only train the Teller with RL, because the Teller has challenges maintaining a long-term strategy throughout a long dialog, whereas preliminary results showed that making local decisions is less detrimental for Drawers. The scene2seq Teller architecture remains unchanged, and each action from the agent is to output a word or one of two special tokens: a nextutterance token and a stop token. After each nextutterance token, our neural Drawer model is used to take an action in the scene and the resulting change in scene similarity metric is used as a reward. However, this reward scheme alone has an issue: once all objects in the scene are described, any further messages will not result in a change in the scene and have a reward of zero. As a result, there is no incentive to end the conversation. We address this by applying a penalty of 0.3 to the reward whenever the Drawer makes no changes to the scene. We train our Teller with RE-INFORCE (Williams, 1992), while the parameters of the Drawer are held fixed.

Training protocol and evaluation
To evaluate our models, we pair our models with other models, as well as with a human. Human-machine pairs. We modified the interface used for data collection to have each trained model to play one game with a human per scene in the test set. We then compare the scene reconstruction quality between human-model pairs for various models and with human-human pairs. Script-based Drawer evaluation. In addition to human evaluation, we would like to have automated evaluation protocols that can quickly estimate the quality of different models. Drawer models can be evaluated against a recorded human conversation from a script (a recorded dialog from the dataset) by measuring scene similarity at the end of the dialog. While this setup does not capture the full interactive nature of the task, the Drawer model still receives human descriptions of the scene and should be able to reconstruct it. Our modeling assumptions include not  Figure 4: A rule-based nearest-neighbor Teller and Drawer pair "trained" on the same data outperforms humans for this scene according to the similarity metric, but the language used by the models doesn't always correspond in meaning to the actions taken. The top row shows a scene from the test set and corresponding human/model reconstructions. The bottom row shows the Teller message and Drawer action from two rounds of conversation by the machine agents.
giving Drawer models the ability to ask clarifying questions, which further suggests that script-based evaluation can reasonably measure model quality.
Machine-machine evaluation. To evaluate Teller models in a goal-driven manner, a "script" from the dataset is not sufficient. We instead consider an evaluation where a Teller model and Drawer model are paired, and their joint performance is evaluated using the scene similarity metric.

Crosstalk training protocol
Automatically evaluating agents, especially in the machine-machine paired setting, requires some care because a pair of agents can achieve a perfect score while communicating in a shared code that bears no resemblance to natural language. There are several ways such co-adaptation can develop. One is by overfitting to the training data to the extent that it's used as a codebook -we see this with the rule-based nearest-neighbor agents described in Section 4.1, where a Drawer-Teller pair "trained" on the same data outperforms humans on the CoDraw task. An examination of the language, however, reveals that only limited generalization has taken place (see Figure 4). Another way that agents can co-adapt is if they are trained jointly, for example using reinforcement learning.
To limit these sources of co-adaptation, we propose a training protocol we call "crosstalk." In this setting, the training data is split in half, and the Teller and Drawer are trained separately on disjoint halves of the training data. When joint training of a Teller-Drawer pair is required (as with reinforcement learning), the training process is run separately for both halves of the training data, but evaluation pairs a Teller trained on the first partition with a Drawer trained on the second. This ensures that models can succeed only by communicating in a way that generalizes to new conversation partners, and not via a highly specialized codebook specific to model instances.

Results
Results for our models are shown in Table 1. All numbers are scene similarities, averaged across scenes in the test set. Neural Drawer is the best Drawer model. In the script setting, our neural Drawer is able to outperform the rule-based nearest-neighbor baseline (3.39 vs. 0.94) and close most of the gap between baseline (0.94) and human performance (4.17). Validity of script-based Drawer evaluation. To test the validity of script-based Drawer evaluation -where a Drawer is paired with a Teller that recites the human script from the dataset corresponding to the test scenes -we include results from interactively pairing human Drawers with a Teller that recites the scripted messages. While average scene similarity is lower than when using live human Tellers (3.83 vs. 4.17), the scripts are sufficient to achieve over 91% of the effectiveness of the same Teller utterances when they were communicated live (according to our metric). The drop in similarity may be in part because the Teller can't answer clarifying questions specific to the Drawer's personal understanding of the instructions. Note that a human Drawer with a script-based Teller still outperforms our best Drawer model paired with a script-based Teller. Benefits of intermediate supervision and goaldriven training. Pairing our models with humans shows that the scene2seq Teller model trained with imitation learning is worse than the rule-based nearest-neighbor baseline (2.69 vs. 3.21), but that the addition of an auxiliary loss followed by finetuning with reinforcement learning allow it to outperform the baseline (3.65 vs. 3.21). However, there is still a gap compared to human Tellers (3.65 vs. 4.17). Many participants in our human study noted that they received unclear instructions from the models they were paired with, or expressed frustration that their partners could not answer clarifying questions as a way of resolving such situations. Recall that our Teller models currently ignore any utterances from the Drawer. Correlation between fully-automated and human-machine evaluation. We also report the result of paired evaluation for different Teller models and our best Drawer, showing that the relative rankings of the different Teller types match those we see when models are paired with humans. This shows that automated evaluation while following the crosstalk training protocol is a suitable automated proxy for human-evaluation.

Typical errors
The errors made by Teller reflect two key challenges posed by the CoDraw task: reasoning about the context of the conversation and what has already been drawn so far, and planning ahead to fully and effectively communicate the required information. A common mistake the rule-based nearest-neighbor Teller makes is to reference objects that are not present in the current scene. Figure 4 shows an example (bottom left) where the Teller has copied a message referencing a "swing" that does not exist in the current scene. In a sample of 5 scenes from the test set, the rule-based nearest-neighbor Teller describes a non-existent object 11 times, compared to just 1 time for the scene2seq Teller trained with imitation learning. The scene2seq Teller, on the other hand, frequently describes clip art pieces multiple times or forgets to mention some of them: in the same sample of scenes, it re-describes an object 10 times (vs. 2 for the baseline) and fails to mention 11 objects (vs. 2.) The addition of an auxiliary loss and RL fine-tuning reduces these classes of errors while avoiding frequent descriptions of irrelevant objects (0 references to non-existent objects, 3 instances of re-describing an object, and 4 objects omitted.) On the Drawer side, the most salient class of mistakes made by the neural network model is semantically inconsistent placement of multiple clip art pieces. Several instances of this can be seen in Figure 9 in Appendix D, where the Drawer places a hat in the air instead of on a person's head, or where the drawn clip art pieces overlap in a visually unnatural way.
Qualitative examples of both human and model behavior are provided in Appendix D.

Conclusion
In this paper, we introduce CoDraw: a collaborative task designed to facilitate learning of effective natural language communication in a grounded context. The task combines language, perception, and actions while permitting automated goal-driven evaluation both at the end and as a measure of intermediate progress. We introduce a dataset and models for this task, and propose a crosstalk training + evaluation protocol that is more generally applicable to studying emergent communication. The models we present in this paper show levels of task performance that are still far from what humans can achieve. Long-term planning and contextual reasoning as two key challenges for this task that our models only begin to address. We hope that the grounded, goal-driven communication setting that CoDraw is a testbed for can lead to future progress in building agents that can speak more naturally and better maintain coherency over a long dialog, while being grounded in perception and actions. A Interface and data collection A.1 Interface Figure 5 shows the interface for the Teller, and Figure 6 shows the interface for the Drawer. Following previous works (Zitnick et al., 2013;Zitnick and Parikh, 2013), Drawers are given 20 clip art objects selected randomly from the 58 clip art objects in the library, while ensuring that all objects required to reconstruct the Teller's scene are available.

A.2 Additional interaction: a chance to peek
To make sure that the natural language focused on the high-level semantics of the scene rather than instructions calling for the execution of low-level clip art manipulation actions, we did not allow Teller to continuously observe Drawer's canvas. However, direct visual feedback may be necessary to get the all the details right. We also hypothesize that such feedback would help human participants calibrate themselves when they are new to the task (models do not have this issue because of the rich supervisory signal available in the collected dataset.) To capture this idea, we give one chance for the Teller to look at the Drawer's canvas using a 'peek' button in the interface. Communication is only allowed after the peek window is closed.
Although we consider the ability to peek to be a part of the CoDraw task, we leave for future work the creation of models that can strategically reason about when to use this chance in a way that maximizes task effectiveness. We note that omitting this behavior from the Teller models described in this paper does not decrease the number of modalities needed to complete the task -our models still incorporate language understanding, language generation, perception, and action.

A.3 Participant statistics
We found that approximately 13.6% of human participants disconnected early, prior to fully completing the task with their partner. We paid participants who stayed in the conversation and had posted at least three messages. However, we exclude those incomplete sessions in the dataset, and only use the completed sessions.
There are 616 unique participants represented in our collected data. Among these workers, the 5 most active have done 26.63% of all finished tasks (1,419, 1,358, 1,112, 1,110, and 1,068 tasks).

Instructions for Teller
Your fellow Turker will ask you questions about your secret scene. 1 Your objective is to help the fellow Turker recreate the scene. You typically describe the details of the image and/or answer their questions. 2 You have to help the fellow Turker to draw the image by answering given questions or describe the details of the image.
Fellow Turker connected. Now you can send messages.

Use Chance
Message Send Type Message Here: Finish HIT! Figure 5: User interface for a Teller. The left image is an abstract scene from Zitnick and Parikh (2013). The Teller sends messages using an input box. The Teller has a single chance to peek at the Drawer's canvas to correct mistakes. The Teller can decide when to finish the session.

Instructions for Drawer
Your objective is to create a scene that matches the Teller's secret scene. 1 Feel free to raise questions about the scene, which your fellow Turker will answer. They can see their secret scene.  Figure 6: User interface for a Drawer. The Drawer sees an empty canvas and a randomly generated drawing palette of Mike, Jenny, and 18 other objects, chosen from a library of 58 clip arts. We ensure that using the available objects, the Drawer can fully reproduce the scene. Using the library, the Drawer can draw on the canvas in a drag-and-drop fashion. The Drawer can also send messages using the input box. However, the peek button is disabled: only the Teller can use it.
Across all workers, the maximum, median, and minimum numbers of tasks finished by a worker are 1,419, 3, and 1, respectively.

A.4 Pre-processing
We pre-process all collected Teller and Drawer utterances using the Bing Spell Check API 5 . The text is then tokenized using the Python Natural Language Toolkit, nltk (Bird et al., 2009). We release the token sequences after pre-processing as part of the CoDraw dataset, so that different mod- In (c), conversations that end early are padded to 35 rounds through the addition of empty messages/actions. els may be compared in a standard set of data conditions. At the same time, raw (unprocessed) text is also made available, to allow revisiting the preprocessing decisions should the need arise.

B Scene similarity metric
The clip art library consists of 58 base clip art types (e.g. the sun, a cloud, Mike, Jenny, soccer ball, etc.) Each clip art object c consists of an identifier id(c) that denotes its type, an indicator feature vector f (c) that determines properties such such as size and orientation (e.g. 1 size=small , 1 size=medium , etc. for a total of 41 binary features), and two real-valued features x(c) and y(c) that encode the x and y position on the canvas, normalized to the 0-1 range.
We represent a scene C as a set of individual clip art objects c ∈ C. We denote by ids(C) the set of clip art types that occurs in the scene. Following Zitnick et al. (2013), a given clip art type may occur at most once in the scene; let C[i] be the clip art c ∈ C such that id(c) = i.
Given a ground-truth scene C and a predicted sceneĈ scene similarity s is defined as: and similarity for pairs of clip art pieces is captured by the term We use parameters w = [5, 1, 0.5, 0.5, 1, 1, 1, 1] which provides a balance between the different components and ensures that scene similarities are constrained to be between 0 and 5. Figure 7a shows the distribution of scene similarity scores throughout the dataset. Figure 7b-c shows the progress of scene similarity scores over the rounds of a conversation. An average conversation is done improving the scene similarity after about 5 rounds, but for longer conversations that continue to 23 rounds, there is still room for improvement.

C Neural Drawer architecture
In this section, we describe in greater detail our neural network architecture approach for the Drawer. Contextual reasoning is an important part of the CoDraw task: each message from the Teller can relate back to what the Drawer has previously heard or drawn, and the clip art pieces it places on the canvas must form a semantically coherent scene. To capture these effects, our model should condition on the past history of the conversation and use an action representation that is conducive to generating coherent scenes.
When considering past history, we make the Markovian assumption that the current state of the Drawer's canvas captures all information from the previous rounds of dialog. Thus, the Drawer need only consider the most recent utterance from the Teller and the current canvas to decide what to draw next. We experimented with incorporating additional context -such as previous messages from the Teller or the action sequence by which the Drawer arrived at its current canvas configuration -but did not observe any gains in performance.
We represent the state of the canvas with a vector v canvas that is the concatenation of feature vectors for each of the 58 possible clip art types: The most recent Teller utterance is encoded into a vector v msg using a bi-directional LSTM. A vector representing the Drawer's action is then computed using a feed-forward network with a rectified linear unit (ReLU) nonlinearity: v action = W out relu(W canvas v canvas The action representation v action has the form: v action = [a 0 ; a 1 ; . . . ; a 57 ] . . . The valuesx(C[i]) andŷ(C[i]) are the predicted location for clipart C[i] if it is placed on the canvas, and each quantity q(event) is a logit corresponding to a particular event. The probability of adding a clip art piece to the scene is calculated using the sigmoid function: while all other probabilities are calculated by applying softmax to each set of mutually-exclusive outcomes, e.g.: where S = small, medium, large . At inference time, the Drawer's action is chosen using greedy decoding. A clip art of type i is added to the canvas if p(i ∈ ids(C)) > 0.5, in which case it is placed at location (x(C[i]),ŷ(C[i])) with its orientation, size, and other attributes set to their most probable values (as determined by the vector a i .) The model is trained using a combination of cross-entropy losses (that maximize the probability of the categorical decisions present in the human action) and an L 2 loss that compares the locations where the human placed each clip art piece with the model's estimate. Figure 8 shows some examples of scenes and dialogs from the CoDraw dataset. The behavior of our Drawer and Teller models on a few randomly-selected scenes is illustrated in Figures 9, 10, and 11.  Figure 8: Examples from the Collaborative Drawing (CoDraw) dataset, chosen at random from the test set. The images depict the Drawer's canvas after each round of conversation. From left to right, we show rounds one through four, then the last round, followed by the ground truth scene. The corresponding conversations between the Teller (T) and Drawer (D) are shown below the images. Note that there is no restriction on which of the two participants begins or ends the dialog. Script: Girl has teeth at horizon and arms up with shovel in center hand, pointing away from her.

Script:
Pink pail covers all her center foot but heel, non of her leg is hidden. front of handle is on left side.
Script: Big sun is jammed with 1/4 off right side and 1/3 off the top.
Script: That is it. I will check.

Neural Network
Figure 9: Drawer model behavior where the Teller is simulated by replaying messages associated with three randomly-selected scenes from the test set. The images depict the Drawer's canvas after each round of conversation. For each of the three scenes, the behavior of the Rule-Based Nearest-Neighbor Drawer is shown in the upper row and the behavior of the Neural Network Drawer is shown in the lower row.  Figure 10: A comparison of the descriptions generated by each of our Teller models for two randomly-sampled scenes from the test set.