Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic “zero-shot” scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel “imagination” module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.


Introduction
Humans do not learn conceptual representations from language alone, but from a wide range of situational information (Beinborn et al., 2018;Bisk et al., 2020) as highlighted also by property-listing experiments (McRae et al., 2005). When humans experience the concept of "boat", they simulate a new representation by reactivating and aggregating multi-modal representations that reside in their memory and are associated with the concept of "boat" (e.g., what a boat looks like, the action of sailing, etc) ( Barsalou, 2008). This simulation process is called perceptual simulation. Therefore, it is no wonder that recent trends in learning conceptual representations adopt multi-modal and holistic approaches (Bruni et al., 2014) wherein abstract distributional lexical representations (Landauer and Dumais, 1997;Laurence and Margolis, 1999) learned from text corpora are augmented or refined with perceptual information for concrete and context-aware representations built from visual (Kiela et al., 2018;Lazaridou et al., 2015), olfactory , or auditory  modalities.
Language games between AI agents, inspired by Wittgenstein's Language Games among humans (Wittgenstein et al., 1953), are an excellent test bed for such approaches since concepts are expected to emerge when agents are required to communicate to solve specific tasks in specific environments. GuessWhat?!  is a prototypical language game of this kind: a Guesser has to identify a target object in a scene represented as an image by asking questions to an Oracle. Learning to ground pixels of the scene into object representations that are relevant for the object category they belong to (category-aware), but are also particularized for the specific scene (context-aware), is fundamental for the Guesser to effectively converse with the Oracle and vice-versa.  2017) and Zhuang et al. (2018) rely on gold category labels at test time, thereby failing to ground novel objects from categories not seen during training (e.g., a "pasticciotto", top right) or to properly encode known categories but with unseen visual features (like a "frosted donut", bottom right) since they employ category embeddings c from a predefined set that are fixed for each object. Instead, embeddings z learned by our imagination module can be flexibly category-aware allowing them to generalize to unseen categories. We consider a model truly multi-modal if it always uses all the modalities to make decisions. However, existing approaches Shekhar et al., 2019) rely instead on gold category labels that are assumed to be available also at inference time, thus making these models depend on this modality and discarding the others. This not only poses an unnatural performance advantage for players in controlled benchmark scenarios like the GuessWhat?! game when categories at inference time match those at training time, but causes them to fail in more realistic zero-shot scenarios (Suglia et al., 2020) where players are required to generalize to out-of-domain object categories. For example, consider an agent that during training has only seen glazed donuts, associated with the fixed "donut" category embedding (cf. Figure 1). At inference time, the model cannot ground visual representations for objects belonging to the "pasticciotto" (an Italian pastry) category, since such a category was not in its repertoire. Similarly, it will likely represent frosted donuts with a generic "donut" embedding, despite the perceptual differences among different types of donut.
In this paper, we tackle the above limitations by introducing a novel imagination module based on Regularized Auto-encoders (Ghosh et al., 2019), which are able to derive imagination embeddings directly from perceptual information in the form of the object crop. Our formulation of the reconstruction loss allows the model to learn context-aware and category-aware imagination embeddings. Thus, removing the need for gold category labels at inference time and greatly improving zero-shot generalization. Section 4.2 integrates our imagination component into the Oracle model of De  and the Guesser model of Shekhar et al. (2019). We show that the new imagination models are state-of-the-art in the recently introduced CompGuessWhat?! benchmark (Suglia et al., 2020) outperforming current models by 8.26%. It also improves the Oracle's and Guesser's accuracy (by 2.08% and 12.86%, respectively) in the standard GuessWhat?! when no gold category labels are available. Lastly, we show that imagining latent object representations greatly helps to reason about object visual properties (i.e., color, shape, etc.), qualifying our module as a generic perceptual simulation component alà Barsalou (2008).

Background: Guessing Games and Concept Representations
GuessWhat?! is an instance of a multi-word guessing game (Steels, 2015). Every game involves two players: an Oracle and a Guesser conversing about a scene S (a natural image). A scene S can be abstracted into a collection of objects O, each of which is associated with a category c i ∈ C, i = {1, . . . , K}. The aim of the Guesser is to identify a target object o * ∈ O by asking questions about S to the Oracle. The gameplay of GuessWhat?! thus comprises three tasks: i) question generation where the Guesser inquires about an object in the scene S given the dialogue generated so far; ii) answer prediction, where the Oracle answers a ∈ A = {Yes, No, N/A} given the scene S, question and the target object o * ; and iii) target prediction where the Guesser selects a candidate object with the highest relevance score r(o i ).
Several architectural variants have been proposed to tackle GuessWhat?! (cf. Section 5 for some related works). In this work we adopt the recent GDSE model (Shekhar et al., 2019), which learns a visually grounded dialogue state used to learn both question generation and target object prediction. As shown below, GDSE does not deliver the desired multi-modality needed, therefore we extend it with our Imagination component to obtain more effective multi-modal object representations.
For successful gameplay, both the Guesser and Oracle must build representations of the scene that contain specific perceptual information of objects (object-aware), are relevant for the object category they belong to (category-aware), and are specialized to the scene in which the game is played (contextaware). As the scene S is an image, it is natural to associate each object o i ∈ O with a perceptual embedding, i.e., a vector v i ∈ R d O extracted from the penultimate layer of a pretrained vision model (e.g. ResNet-152 (Shekhar et al., 2019)) based on their bounding box. 1 However, these representations are not sufficient as they are neither context-aware nor category-aware, i.e., they ignore other objects in the scene and do not leverage their category information. GDSE and other recent approaches Shekhar et al., 2019;Zhuang et al., 2018;Shukla et al., 2019) coped with the second issue by introducing category embeddings as d C -dimensional continuous representations c k ∈ R d C for k = 1, . . . , K. Once learned, a category embedding c is then concatenated to an 8-dimensional feature vector s i derived from the object bounding box (cf. De Vries et al. (2017)). While these embeddings partially solve category-awareness, they are not object-aware. For instance, the embedding for the object category "apple" will be the same regardless of a particular object to be a red or green apple, i.e., most likely a centroid representation of the objects seen only during training. Moreover, if during training we only see red apples, at inference time, we will likely fail to detect green apples as belonging to the same category (Figure 2(a)). These issues have gone unnoticed since category embeddings usually boost performances on the original GuessWhat?! task, given that gold category labels are also available at inference time. However, this boost is illusory: models relying on this symbolic information to be always available are not learning to exploit all modalities. In fact, a 20% drop in the Guesser accuracy if gold category labels are not provided has been reported in Zhuang et al. (2018) for GuessWhat?! and analogous poor results in more realistic benchmarks measuring zero-shot generalization such as CompGuessWhat?! (Suglia et al., 2020).

Imagination Module: Learning Context-and Category-aware Object Representations
To overcome the limitations of GDSE and competitors and realize a form of perceptual simulation in a learning system, we introduce a generic component-named the imagination module-which learns latent concept representations that are both context-and category-aware, without relying on category labels at inference time. Our imagination model can be understood in the context of representation learning via deep generative models (Bengio et al., 2013) which has been popularized by variational autoencoders (VAEs) (Kingma and Welling, 2013;, and GANs (Goodfellow et al., 2014). Specifically, we substantially extend the recently introduced regularized autoencoders (RAEs) framework (Ghosh et al., 2019). RAEs are simplified VAEs where stochasticity in the encoder and decoder is dropped in favor of more stable training and more informative embedding learning. In fact, RAEs do not suffer from several issues known to affect VAEs, such as poor convergence and the possibility of learning embeddings that are independent of the input images (cf. Ghosh et al. (2019) for a detailed discussion). More crucially for our purposes, RAEs do not have to compromise the informativeness of the learned embeddings with a fixed a-priori structure in the latent space that enables simple sampling (e.g., an isotropic Gaussian prior). VAEs which need to have such a fixed prior, instead, are deemed to learn embeddings that are less informative w.r.t. objects, categories, and context information.
Module architecture. Figure 2(a) summarizes our imagination module. Its aim is to distill a context and category-aware embedding z i ∈ R d Z per object o i in scene S. To this end, we adopt an encoder E φ parameterized by φ that maps a perceptual embedding v i of object o i to its imagined counterpart z i , i.e., being also called the reconstruction of the input v i . As in RAEs, our per-object loss L IMG comprises a reconstruction loss (L REC ), weighting how good the reconstructions of D θ are w.r.t. the encoded representations by E φ , and a regularization term (L REG ) enhancing generalization by smoothing the decoder D θ . This leads to the following composite loss: where α is an hyperparameter controlling regularization. 2 As in L 2 -RAE (Ghosh et al., 2019), the regularization component is defined as L REG := ||z i || + ||θ|| 2 : the first term bounds the latent embedding space learned by E φ easing optimization; the second enforces smoothing over D θ improving generalization over regions of the latent space that are unseen during training. Differently from RAEs, we devise a specific reconstruction loss tailored to learn contextual and category-aware representations. In conventional RAEs, in fact, the reconstruction loss is defined as the Mean Squared Error (MSE) representing the distance between v i and its reconstructionṽ i , so that This loss is purely unsupervised and as such agnostic to object categories or to the scene context. To our aims, we define a custom imagination reconstruction loss L IMG REC as an instance of a max-margin triplet-loss (Wang et al., 2014;Schroff et al., 2015), as follows. Let c i be the category of object o i with perceptual embedding v i in scene S and let O ¬c i = {o j | o j ∈ O ∧ c j = c i } be the set of all objects in S belonging to a different category than c i . Our per-object L IMG REC term is defined as: where η is the minimum margin between two components: i) the distance between the perceptual embedding v i and its reconstruction D θ (z i ), and ii) the distance between the perceptual embedding v j of a randomly sampled object o j ∈ O ¬c i and the reconstruction D θ (z i ). By doing so, we enforce each object representation to be representative of its category given a specific context by locally contrasting it to another object of a different category in the same scene. Note that this is strikingly different from previous approaches employing a max-margin loss (Elliott and Kádár, 2017;Kiros et al., 2018) where "negative" objects are arbitrarily sampled from other scenes in the same batch.
Imagining at inference time. Differently from the category embeddings c employed by all previous work, our imagination embeddings z do not depend on gold category labels at inference time, while still being context-aware and category-aware. In fact, once parameters φ have been learned, the encoder E φ contains all the information needed to distill embeddings z independently of L IMG , which is necessary only at training time. We consider imagination the ability of the model of generating latent representations on-the-fly. Therefore, for both Guesser and Oracle models we consider an object representation for object o i that replaces c i with z i and concatenates it with its spatial information s i (see Figures 2(b) and 2(c) and Appendix A.1 for details). By doing so, we consider every gameplay situated in a reference scene as an experience where our imagination module is able to derive a latent conceptual representation simply by "looking" at objects, realizing a perceptual simulator (Barsalou, 2008). We plan to investigate how to combine label-dependent category embeddings c with our imagination embeddings z, similarly to how some VAE variants tackle semi-supervised classification scenarios .

Experimental Investigation
To assess the impact of using the imagination embeddings against the category embeddings, we use two evaluation benchmarks: GuessWhat?! and CompGuessWhat?!. More information about the training procedure can be found in Appendix A.2.

GuessWhat?! Evaluation
In this experiment, we evaluate the accuracy of the Oracle in answering questions and the accuracy of the Guesser in selecting the target object. We consider as both training and evaluation data all the gold dialogues (and questions) that have been labeled as successful in the dataset . We want to highlight that in this evaluation phase, the models using label-aware object encodings have gold information both at training and test time. This is true both for the Oracle and Guesser models. However, this does not hold for all other models using the imagination component. Guesser task. Similarly, we compare the GDSE model using imagination embeddings (GDSE+IMAGINATION) with the following label-aware baselines: 1) text-only baselines using LSTM encoder (LSTM) and Hierarchical Recurrent Encoder-Decoder architecture (Serban et al., 2017) (HRED) as well as their corresponding multi-modal models LSTM+IMAGE and HRED+IMAGE; 2) PARALLELATTENTION (Zhuang et al., 2018) and GDSE (Shekhar et al., 2019). We also compare with variants of the above that do not use any category embeddings or gold category labels (*-NOCAT), as well as models with predicted category labels (*-PREDCAT   However, by relying on symbolic information in the form of category labels, it is inevitably not truly multimodal anymore because the heavy-lifting is done by these embeddings. As shown in the results, other multi-modal models such as QUES-TION+SPATIAL+CROP and QUESTION+CROP, are not able to learn effective representations to bridge the gap between category-aware and category-free models. On the other hand, the proposed imagination model is able to reduce this gap without relying on gold information as input. Indeed, we are able to learn categoryaware and context-aware latent codes by using category information only in our loss function. We investigate this argument further by using a rule-based question classifier (Shekhar et al., 2019) to partition the test questions according to their type. Table 2 summarizes this analysis; we include models considered truly multi-modal and the best Oracle model QUESTION+SPATIAL+CATEGORY. The latter can answer with high accuracy questions about specific object instances (e.g., "is it the dog?") or super-categories (e.g., "is it an animal?") since it is using category embeddings as input. However, when it comes to answering questions about perceptual properties of the target object, it loses some accuracy points because the perceptual information is missing from the category embedding representing a centroid of typical instances seen at training time only. On the other hand, the IMAGINATION model is able to bring improvements of 1.34%, 5.81%, and 2.52% for location, color, and shape questions, respectively. On questions related to perceptual information, models using crop information seem to be on par with the IMAGINATION model. However, our model is able to obtain an improvement over +CROP of 1.84% in object questions and of 1.11% on super category questions solely by relying on the imagination embeddings.
Guesser task. Table 3 compares several category-aware and multi-modal models; PARALLELATTEN-TION and GDSE-SL are the two best performing configurations. However, when PARALLELATTEN-TION does not have access to category information (PARALLELATTENTION-NOCAT) its performance drops by 3.7% (also noted by Zhuang et al. (2018)). We confirmed the same behavior for GDSE-SL as well (GDSE-SL-NOCAT), noticing a more significant drop in performance of 16.95% which is in line with the simpler LSTM+IMAGE model. On the other hand, GDSE-SL with our imagination component (GDSE-SL+IMAGINATION), performs comparably with the category-aware model and better then all   (Suglia et al., 2020). We assess model quality in terms of gameplay accuracy, attribute prediction quality, measured in terms of F1 for the abstract (A-F1), situated (S-F1), abstract+situated (AS-F1) and location (L-F1) prediction scenario, as well as zero-shot learning gameplay. GROLLA is a macro-average of the individual scores.
multi-modal models. Therefore we argue that it is possible to learn object representations that, given a representation for the current dialogue state, allow for discriminating the target object among other candidates without relying on symbolic information.  Table 3: Guesser accuracy on successful gold dialogues: we compare GDSE-SL-IMAGINATION with i) models that are truly multi-modal (MM) and ii) use category information (CATEGORY).
CompGuessWhat?! is a benchmark proposed to assess the quality of models' representations and out-of-domain generalization. It includes the following tasks: a) in-domain gameplay accuracy, -selecting the target object with model generated dialogues as input, b) attribute prediction task -assessing the ability of the dialogue representation to recover target object attributes, and c) zero-shot gameplay accuracy -selecting the target object among objects belonging to categories never seen by the model during training. In contrast to GuessWhat?!, the attribute prediction and zero-shot tasks give us more insights about the quality of the learned representations and the model's generalization ability.

Experimental Setup
We compare imagination-based models with baselines used in Suglia et al.

Results
In-domain gameplay. Table 4 presents the results on the CompGuessWhat?! benchmark. Models are tasked to play the game by generating up to 10 questions and corresponding answers. Firstly, we note that the results for GDSE-CL+IMAGINATION-the collaborative version of the model with Imaginationis still in the same ballpark of more complex models, such as DEVRIES-RL that is using category embeddings as input. At the same time, we notice that overall both imagination models perform worse than the GDSE-* models. We impute this drop to the introduction of additional loss terms that probably have changed the training dynamic of a cumbersome modulo-n multi-task training (Shekhar et al., 2019). This downside calls for a more principled way of handling tasks of different complexity (i.e., question generation and target prediction) in a multi-task learning system; we leave this for future work.
Attribute prediction. Table 4 reports the attribute prediction task results. In this scenario, we underline the fact that the dialogue state representation generated by the Guesser model is used to recover several types of attributes associated with the target object. In this work, we use the same dialogue state representation as used by Shekhar et al. (2019) and only focus on improving the object representations using the imagination component. Indeed, the best imagination model GDSE-SL+IMAGINATION is in line with GDSE-SL, currently the best model in terms of attribute prediction. In particular, even though the dialogue state representation is only indirectly affected by the imagination embeddings (via a dotproduct operation to score the candidate objects), we can still see an improvement in terms of F1 for Location attributes (L-F1) and similar performance for Situated attribute prediction (S-F1). Both can be considered, to some extent, a result of better situated object representations.
Zero-shot gameplay. As underlined in Section 3, the imagination module's main strength is to be able to distill imagination embeddings from perceptual information only, without relying on externally provided category labels. The zero-shot gameplay scenario from CompGuessWhat?! (Table 4) sheds some light on the ability of the model to generalize to out-of-distribution examples. In the out-of-domain gameplay scenario where candidate objects belonging to categories never seen before are present, both imagination-based models GDSE-SL+IMAGINATION and GDSE-CL+IMAGINATION outperform the previous best performing system DEVRIES-RL by 1.2% and 8.26%, respectively in terms of OD accuracy (OD-ACC). By analyzing their output, we notice that the best imagination model achieves higher accuracy by learning a better gameplay strategy involving half the amount of location questions generated by DEVRIES-RL (39.68% vs 75.84%; see Appendix A.3 for more details). A further improvement in the near-domain scenario (ND-ACC) confirms the effectiveness of the imagination component to generate category embeddings for objects on-the-fly using only perceptual information.
Out-of-domain error analysis. Lastly, we report an error analysis comprising 50 dialogues selected at random from out-of-domain games (for more details refer to Appendix A.3). First, we manually annotated the Oracle answers and partitioned them according to their type using the same question classifier used for the Oracle Task (Section 4.1.2). 83% of super-category questions (from a total of 80) were correctly answered by the model and 63.36% color related questions (from a total of 88) were correctly answered. For instance, as shown in Figure 3, GDSE-CL is not able to answer correctly the question "is it a person?" because it does not have category information for the label "girl" but only for the label  Figure 3: Qualitative examples in the zero-shot gameplay scenario: the categories 'girl' and 'antelope' are not present in MSCOCO and therefore cannot be encoded by the GDSE-CL model. On the other hand, the imagination model is able to distill imagination embeddings by using the crop features only (for the sake of presentation quality we remove consecutive repeated questions).
"person". On the other hand, GDSE-CL+IMAGINATION is able to a) categorize the object as a member of the super-category "person", and b) correctly ground the expression "kid on the bike" to the target object. The same behavior can be observed when the "antelope" is the target object. Antelopes are not part of the MSCOCO classes, and therefore have not been seen by the model during training. First, the model refers to it as "animal", hence the Oracle is able to correctly answer the question even though "antelope" was never involved in the training. Secondly, we found that the number of N o answers for GDSE-CL is considerably higher (88.06%) than GDSE-CL+IMAGINATION (51.02%), validating our hypothesis that the Oracle does not know how to deal with unseen instances. Finally, in the imagination dialogue of the first example, even though the generated question/answers were probably referring to the correct object, the Guesser model is eventually unable to guess correctly. More work is required to better fuse the language modality and the object representations to improve its performance.

Related Work
Concerning unsupervised learning of concept representations, Bruni et al. (2014) first learn modalityspecific representations and then fuse them into a unified representation for each concept. However, they rely on hand-crafted bags of visual features, making the approach laborious to extend to new domains and games. Kiela et al. (2018) cope with this issue by relying on CNN models to extract latent features from images for instances of specific objects. Lazaridou et al. (2015) use a margin loss but in the context of maximizing the similarity between the visual representation of a noun phrase and its corresponding text representation. Similarly, Collell et al. (2017) learn a mapping between the ResNet features and the word embeddings of a concept. As discussed in Section 2, unlike our imagination embeddings, these purelyperceptual representations are neither category-aware nor context-aware. Silberer et al. (2016) present a multi-modal model that uses a denoising auto-encoder framework. Unlike us, they do not use perceptual information as input but rely on an attribute-based representation derived from an additional attribute predictor. However, they do use a reconstruction loss (cross-entropy loss for attribute prediction) and an auxiliary category loss during training. Their training scheme is more complex as they first separately train the AE for each modality and then fuse them, which we avoid by adopting a single end-to-end architecture. Ebert and Pavlick (2019) used VAEs to learn grounded representations for lexical concepts. However, as discussed in Section 3, VAEs are not as well suited as RAEs to representation learning for our imagination module. In the context of guessing games, all the previous approaches rely on categories embeddings Shekhar et al., 2019;Zhuang et al., 2018;Shukla et al., 2019) (see Section 2). Our imagination component can be flexibly integrated in any of them by replacing the category embeddings with imagination embeddings.

Conclusions
We argued that existing models for learning grounded conceptual representations fail to learn compositional and generalizable multi-modal representations, relying instead on the use of category labels for every object in the scene both at training and inference time . To address this, we introduced a novel "imagination" module based on Regularized Auto-Encoders, that learns a context-aware and category-aware latent embedding for every object directly from its image crop, without using category labels. We showed state-of-the-art performance in the CompGuessWhat?! zeroshot scenario (Suglia et al., 2020), outperforming current models by 8.26% in gameplay accuracy while performing comparably on the other tasks to models which use category labels at training time. The imagination-based model also shows improvements of 2.08% and 12.86% in Oracle and Guesser accuracy. Finally, we conducted an extensive error analysis and showed that imagination embeddings help to reason about object visual properties and attributes. For future work, we plan to 1) integrate category labels at training time in a more principled way following advances in semi-supervised learning ; 2) improve the multi-task learning procedure presented in (Shekhar et al., 2019) to optimize at the same time multiple tasks of different complexities.

A Appendix
A.1 Model details As described in Section 3 of the main paper, we extend both the Oracle and Guesser model with an imagination component. For both roles, we keep the same model structure for the imagination component.
In this paper we implement E φ as a 2-layer feed-forward neural network with ReLU (Dahl et al., 2013) activation function. We acknowledge that many other implementations are possible in this case and we leave more complex designs for future work. Given the latent code z i generated by the function E φ , we use a decoder D θ to generate the reconstructed perceptual input (imagined) of the object o i , D θ (z i ) =ṽ i . As common practice, we define the decoder D θ as symmetric to the architecture of the encoder E φ . For the category embeddings size d c , as in (Shekhar et al., 2019), we use 256 and 512 for the Oracle and Guesser respectively. For the imagination component, we run a grid search involving several parameters for the latent code z such as (16,32,64,128,256,512). For both roles, we choose 512 because it was the value that lead to the highest accuracy on the validation set. We also experimented with several values for the coefficient α of the regularization term L REG : (1e-3, 1e-5, 1e-6, 1e-7). For the Oracle the best value resulted to be 1e − 7, while 1e − 5 for the Guesser. When training the imagination component with the object category loss, due to the class imbalance, we apply loss weighting. We compute the class weights using the method reported in (King and Zeng, 2001). For the margin value η we opted for 1.0 after experimenting with a less effective dynamic margin that would change depending on the distance between the concepts in the WordNet hierarchy.

A.2 Training details
For both roles, we train the models using the Adam optimizer (Kingma and Ba, 2014). For the Oracle and Guesser training we use 0.0001 as learning rate. In both cases, we use the original GuessWhat?! validation set to select the best model that is used in the evaluation on the test set. As described in (Shekhar et al., 2019), we use a modulo-n training procedure to jointly optimize both the Guesser and Questioner. In our experimental evaluation we run a grid search of several values of n such as 3, 5, 7. We selected 5 as the best performing value on the validation set. For a fair comparison with all the GDSE model variants trained with Supervised Learning and Collaborative Learning, we made the same architectural choices and hyperparameters values. Please refer to the original codebase implementation available on GitHub 4 . Another point of difference is in the Collaborative Learning fine-tuning phase for the Guesser model. During this phase, only the Questioner and Guesser models are fine-tuned whereas the Oracle model is fixed (Shekhar et al., 2019) therefore, we decided to use the best performing Oracle so that the Guesser model is not negatively affected by a less performing Oracle and also to be comparable with the original implementation.

A.3 Error analysis
In order to provide a more fine-grained evaluation of the generated dialogues, we adapt the quality evaluation script presented by Suglia et al. (2020) and extend it with additional metrics. First of all, it relies on a rule-based question classifier that classifies a given question in one of seven classes: 1) supercategory (e.g., "person", "utensil", etc.), 2) inanimate object (e.g., "car", "oven", etc.), 3) animate object (e.g., "dog", "cat", etc.), 3) "color", 4) "size", 5) "texture", 6) "shape" and "location". The question classifier is useful to evaluate the dialogue strategy learned by the models. In particular, we look at two types of turn transitions: 1) super-category → object/attr, it measures how many times a question with an affirmative answer from the Oracle related to a super-category is followed by either an object or attribute question (where "attribute" represents the set {color, size, texture, shape and location}; 2) object → attr, it measures how many times a question with an affirmative answer from the Oracle related to an object is followed by either an object or attribute question. We compute the lexical diversity as the type/token ratio among all games, question diversity and the percentage of games with repeated questions. We also evaluate the percentage of dialogue turns involving location questions. Table 5 and 6 show the results of these analysis for the models GDSE-CL and GDSE-CL+imagination analyzed in this paper.
Using the above-mentioned question classifier, we completed an error analysis trying to understand the quality of the generated gameplay in a zero-shot scenario from the point of view of the answers prediction performance and the guesser accuracy. In particular, we randomly sampled a pool of 50 reference games from the out-of-domain zero-shot scenario and we manually annotated whether a given answer generated by the Oracle model was correct or not. Table 7 shows the results of the manual annotation step. The model confirms high performance in answering questions about super-category information demonstrating that it is able to correctly categories objects in macro-categories even though is has not seen them before.    Table 7: Error analysis results completed on the Out-of-domain zero-shot scenario for the model GDSE-CL+Imagination (on the left) and GDSE-CL (on the right).