A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on spatial expressions and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model’s understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.


Introduction
Visual dialogue is the task of holding natural, often goal-oriented conversation in a visual context (Das et al., 2017a;De Vries et al., 2017). This typically involves two types of advanced grounding: symbol grounding (Harnad, 1990), which bridges symbolic natural language and continuous visual perception, and common grounding (Clark, 1996), which refers to the process of developing mutual understandings through successive dialogues. As noted in Monroe et al. (2017); Udagawa and Aizawa (2019), the continuous nature of visual context introduces challenging symbol grounding of nuanced and pragmatic expressions. Some further incorporate par-tial observability where the agents do not share the same context, which introduces complex misunderstandings that need to be resolved through advanced common grounding (Udagawa and Aizawa, 2019;Haber et al., 2019).
Despite the recent progress on these tasks, it remains unclear what types of linguistic structures can (or cannot) be properly recognized by existing models for two reasons. First, existing datasets often contain undesirable biases which make it possible to make correct predictions without recognizing the precise linguistic structures (Goyal et al., 2017;Cirik et al., 2018;Agarwal et al., 2020). Second, existing datasets severely lack in terms of sophisticated linguistic analysis, which makes it difficult to understand what types of linguistic structures exist or how they affect model performance.
To address this problem, we make the following design choices in this work: • We focus on OneCommon Corpus Aizawa, 2019, 2020), a simple yet challenging collaborative referring task under continuous and partially-observable context. In this dataset, the visual contexts are kept simple and controllable to remove undesirable biases while enhancing linguistic variety. In total, 5,191 natural dialogues are collected and fully annotated with referring expressions (which they called markables) and their referents, which can be leveraged for further linguistic analysis.
• To capture the linguistic structures in these dialogues, we propose to annotate spatial expressions which play a central role in visually grounded dialogues. We take inspiration from the existing annotation frameworks (Pustejovsky et al., 2011a,b;Petruck and Ellsworth, 2018;Ulinski et al., 2019) but also make several simplifications and modifications to improve coverage, Figure 1: Example dialogue from OneCommon Corpus with reference resolution annotation (left) and our spatial expression annotation (right). We consider spatial expressions as predicates and annotate their arguments as well as modifiers. For further details of the original dataset and our annotation schema, see Section 3. efficiency and reliability. 1 As shown in Figure 1, we consider spatial expressions as predicates with existing markables as their arguments. We distinguish the argument roles based on subjects and objects 2 and annotate modifications based on nuanced expressions (such as slightly). By allowing the arguments to be in previous utterances, our annotation also captures argument ellipsis in a natural way.
In our experiments, we focus on reference resolution to study the model's comprehension of these linguistic structures. Since we found the existing baseline to perform relatively poorly, we propose a simple method of incorporating numerical constraints in model predictions, which significantly improved its prediction quality.
Based on our annotation, we conduct a series of analyses to investigate whether the model predictions are consistent with the spatial expressions. Our main finding is that the model is adept at recognizing entity-level attributes (such as color and size), but mostly fails in capturing inter-entity relations (especially placements): using the terminologies from Landau and Jackendoff (1993), the model can recognize the what but not the where in spatial language. We also conduct further analyses to investigate the effect of other linguistic factors.
Overall, we propose a novel framework and re-source for conducting fine-grained linguistic analyses in visually grounded dialogues. All materials in this work will be publicly available at https: //github.com/Alab-NII/onecommon to facilitate future model development and analyses.

Related Work
Linguistic structure plays a critical role in dialogue research. From theoretical aspects, various dialogue structures have been studied, including discourse structure (Stent, 2000;Asher et al., 2003), speech act (Austin, 1962;Searle, 1969) and common grounding (Clark, 1996;Lascarides and Asher, 2009 Hansen and Søgaard, 2020), intent recognition (Silva et al., 2011;Shi et al., 2016), semantic representation/parsing (Mesnil et al., 2013;Gupta et al., 2018) and frame-based dialogue state tracking (Williams et al., 2016;El Asri et al., 2017). However, most prior work focus on dialogues where information is not grounded in external, perceptual modality such as vision. In this work, we propose an effective method of analyzing linguistic structures in visually grounded dialogues. Recent years have witnessed an increasing attention in visually grounded dialogues (Zarrieß et al., 2016;de Vries et al., 2018;Alamri et al., 2019;Narayan-Chen et al., 2019). Despite the impressive progress on benchmark scores and model architec-tures (Das et al., 2017b;Wu et al., 2018;Kottur et al., 2018;Gan et al., 2019;Shukla et al., 2019;Niu et al., 2019;Zheng et al., 2019;Kang et al., 2019;Murahari et al., 2019;Pang and Wang, 2020), there have also been critical problems pointed out in terms of dataset biases (Goyal et al., 2017;Chattopadhyay et al., 2017;Massiceti et al., 2018;Chen et al., 2018;Kottur et al., 2019;Kim et al., 2020;Agarwal et al., 2020) which obscure such contributions. For instance, Cirik et al. (2018) points out that existing dataset of reference resolution may be largely solvable without recognizing the full referring expressions (e.g. based on object categories only). To circumvent these issues, we focus on OneCommon Corpus where the visual contents are simple (exploitable categories are removed) and well-balanced (by sampling from uniform distributions) to minimize dataset biases.
Although various probing methods have been proposed for models and datasets in NLP (Belinkov and Glass, 2019;Geva et al., 2019;Kaushik et al., 2020;Gardner et al., 2020;Ribeiro et al., 2020), fine-grained analyses of visually grounded dialogues have been relatively limited. Instead, Kottur et al. (2019) proposed a diagnostic dataset to investigate model's language understanding: however, their dialogues are generated artificially and may not reflect the true nature of visual dialogues. Shekhar et al. (2019) also acknowledges the importance of linguistic analysis but only dealt with coarse-level features that can be computed automatically (such as dialogue topic and diversity). Most similar and related to our research are Yu et al. (2019); Udagawa and Aizawa (2020), where they conducted additional annotation of reference resolution in visual dialogues: however, they still do not capture more sophisticated linguistic structures such as PAS, modification and ellipsis.
Finally, spatial language and cognition have a long history of research (Talmy, 1983;Herskovits, 1987). In computational linguistics, (Kordjamshidi et al., 2010;Pustejovsky et al., 2015) developed the task of spatial role labeling to capture spatial information in text: however, they do not fully address the problem of annotation reliability nor grounding in external visual modality. In computer vision, the VisualGenome dataset (Krishna et al., 2017) provides rich annotation of spatial scene graphs constructed from raw images, but not from raw dialogues. Ramisa et al. (2015); Platonov and Schubert (2018) also worked on modelling spa-tial prepositions in single sentences. To the best of our knowledge, our work is the first to apply, model and analyze spatial expressions in visually grounded dialogues at full scale.

Dataset
Our work extends OneCommon Corpus originally proposed in Udagawa and Aizawa (2019). In this task, two players A and B are given slightly different, overlapping perspectives of a 2-dimensional grid with 7 entities in each view (Figure 1, left). Since only some (4, 5 or 6) of them are in common, this setting is partially-observable where complex misunderstandings and partial understandings are introduced. In addition, each entity only has continuous attributes (x-value, y-value, color and size), which introduce various nuanced and pragmatic expressions. Note that all entity attributes are generated randomly to enhance linguistic diversity and reduce dataset biases. Under this setting, two players were instructed to converse freely in natural language to coordinate attention on one of the same, common entities. Basic statistics of the dialogues are shown at the top of Table 1    More recently, Udagawa and Aizawa (2020) curated all successful dialogues from the corpus and additionally conducted reference resolution annotation. Specifically, they detected all referring expressions (markables) based on minimal noun phrases by trained annotators and identified their referents by multiple crowdworkers (Figure 1 left, highlighted). Both annotations were shown to be reliable with high overall agreement. We show their dataset statistics at the bottom of Table 1.
In this work, we randomly sample 600 dialogues from the latest corpus (5,191 dialogues annotated with reference resolution) to conduct further annotation of spatial expressions.

Annotation Schema
Our annotation procedure consists of three steps: spatial expression detection, argument identification and canonicalization. Based on these annotation, we conduct fine-grained analyses of the dataset (Subsection 3.3) as well as the baseline models (Subsection 4.2). For further details and examples of our annotation, see Appendix A.

Spatial Expression Detection
Based on the definition from Pustejovsky et al. (2011a,b), spatial expressions are "constructions that make explicit reference to the spatial attributes of an object or spatial relations between objects". 3 We generally follow this definition and detect all spans of spatial attributes and relations in the dialogue. To make the distinction clear, we consider entity-level information like color and size as spatial attributes, and other information such as location and explicit attribute comparison as spatial relations. Spatial attributes could be annotated as adjectives ("dark"), prepositional phrases ("of light color") or noun phrases ("a black dot"), while spatial relations could be adjectives ("lighter"), prepositions ("near"), and so on. We also detect modifiers of spatial expressions based on nuanced expressions (c.f. Table 2).
Although we allow certain flexibility in determining their spans, holistic/dependent expressions (such as "all shades of gray", "sloping up to the right", "very slightly") were instructed to be annotated as a single span. Independent expressions (e.g. connected by conjunctions) could be annotated separately or jointly if they had the same structure (e.g. same arguments and modifiers).
For the sake of efficiency, we do not annotate spatial attributes and their modifiers inside markables (see Figure 1), since their spans and arguments are easy to be detected automatically.

Argument Identification
Secondly, we consider the detected spatial expressions as predicates and annotate referring expressions (markables) as their arguments. This approach has several advantages: first, it has broad coverage since referring expressions are prevalent in visual dialogues. In addition, by leveraging exophoric references which directly bridge natural language and the visual context, we can conduct essential analyses related to symbol grounding across the two modalities (Subsection 4.2).
To be specific, we distinguish the argument roles based on subjects and objects. We allow arguments to be in previous utterances only if they are unavailable in the present utterance. Multiple markables can be annotated for the subject/object roles, and no object need to be annotated in cases of spatial attributes, nominal/verbal expressions ("triangle", "clustered") or implicit global objects as in superlatives ("darkest (of all)"). If the arguments are indeterminable based on these roles (as in enumeration, e.g. "From left to right, there are ..."), they were marked as unannotatable. Modificands of the modifiers (which could be either spatial attributes or relations) were also identified in this step.

Canonicalization
Finally, we conduct canonicalization of the spatial expressions and modifiers. Since developing a complete ontology for this domain is infeasible or too expensive, we focus on canonicalizing the central spatial relations in this work: we do not canonicalize spatial attributes manually, since we can canonicalize the central spatial attributes automatically (c.f. Subsubsection 4.2.1).
According to Landau (2017), there are 2 classes of relations in spatial language: functional class whose core meanings engage force-dynamic relationship (such as on, in) and geometric class whose core meanings engage geometry (such as left, above). Since functional relations are less common in this dataset and more difficult to define due to their vagueness and context dependence (Platonov and Schubert, 2018), we focus on the following 5 categories of geometric relations and attribute comparisons, including a total of 24 canonical relations which can be defined explicitly.
Direction requires the subjects and objects to be placed in certain orientation: left, right, above, below, horizontal, vertical, diagonal.
Proximity is related to distance between subjects, objects or other entities: near, far, alone.
Region restricts the subjects to be in a certain region specified by the objects: interior, exterior.
Color comparison is related to comparison of color between subjects and objects: lighter, lightest, darker, darkest, same color, different color.
Size comparison is related to comparison of size between subjects and objects: smaller, smallest, larger, largest, same size, different size.
To be specific, we annotate whether each detected spatial relation implies any of the 24 canonical relations. Each spatial relation can imply multiple canonical relations (e.g. "on the upper right" implies right and above) or none (e.g. "triangle" does not imply any of the above relations).
In addition, we define 6 modification types (subtlety, extremity, uncertainty, certainty, neutrality and negation) and canonicalize each modifier into one type. For example, "very slightly" is considered to have the overall type of subtlety.  To test the reliability of our annotation, two trained annotators (the authors) independently detected the spatial expressions and modifiers in 50 dialogues. Then, using the 50 dialogues from one of the annotators, two annotators independently conducted argument identification and canonicalization. We show the observed agreement and Cohen's κ (Cohen, 1968) in Table 3.

Annotation Reliability
For span detection, we computed the token level agreement of spatial expressions and modifiers. Despite having certain freedom for determining their spans, we observed very high agreement (including their starting positions, see Appendix B).
For argument identification, we computed the exact match rate of the arguments and modificands. As a result, we observed near perfect agreement for subject/modificand identification. For object identification, the result seems relatively worse: however, upon further inspection, we verified that 73.5% of the disagreements were essentially based on the same markables (e.g. coreferences).
Finally, we observed reasonably high agreement for relation/modifier canonicalization as well. Overall, we conclude that all steps of our annotation can be conducted with high reliability.  Table 4: Statistics of our spatial expression annotation in 600 randomly sampled dialogues.

Annotation Statistics
The basic statistics of our annotation are summarized in Table 4. Note that there are relatively few spatial attributes annotated, since most of them appeared inside the markables (hence not detected manually). However, a large number of spatial relations with non-obvious structures were identified.
In both expressions, we found over 1% of the subjects and 14% of the objects to be present only in previous utterances, which indicates that argument level ellipses are common and need to be resolved in visual dialogues. For spatial relations, about 30% did not have any explicit objects.
Our annotation also verified that a large portion of the spatial expressions (37% for spatial attributes and 17% for relations) accompanied modifiers.
Finally, less than 1% of spatial expressions were unannotatable based on our schema, which verifies its broad coverage. Overall, our annotation can capture important linguistic structures of visually grounded dialogues, and it is straightforward to conduct even further analyses (e.g. by focusing on specific canonical relations or modifications).

Reference Resolution
Reference resolution is an important subtask of visual dialogue that can be used for probing model's understanding of intermediate dialogue process (Udagawa and Aizawa, 2020). As illustrated in Figure 1 (left), this is a simple task of predicting the referents for each markable based on the speaker's perspective. To collect model predictions for all dialogues, we split the whole dataset into 10 equalsized bins and use each bin as the test set in 10 rounds of the experiments. For a more detailed setup of our experiments, see Appendix C. As a baseline, we use the REF model proposed in Udagawa and Aizawa (2020). As shown in Figure  2, this model has two encoders: dialogue encoder based on a simple GRU (Cho et al., 2014) and entity encoder which outputs entity-level representation of the observation based on MLP and relational network (Santoro et al., 2017). To predict the referents, REF takes the GRU's start position of the markable, end position of the markable and end position of the utterance to compute entity-level scores and judge whether each entity is a referent based on logistic regression.

Models
However, since the predictions are made independently for each entity, this model often predicts the wrong number of referents, leading to low performance in terms of exact match rate. To address this issue, we trained a separate module to track the number of referents in each markable. We formulate this as a simple classification task between 0, 1, ..., 7, which can be predicted reliably with an average accuracy of 92%. Based on this module's prediction k, we simply take the top k entities with the highest scores as the referents. We refer to this numerically constrained model as NUMREF. Furthermore, we conduct feature level ablations to study the importance of each feature: for in-stance, we remove the xy-values from the structured input to ablate the location feature.  We report the mean and standard deviation of the entity-level accuracy and markable-level exact match rate in Table 5. Compared to REF, our NUMREF model slightly improves the entity-level accuracy and significantly outperforms it in terms of exact match rate, which validates our motivation. From the ablation studies, we can see that all features contribute to the overall performance, but color and size seem to have the largest impact.

Entity-Level
However, it is difficult to see how and where these models struggle based on mere accuracy. For further investigation, we need more sophisticated behavioral testing (namely black-box testing) to verify whether each model has the capability of recognizing certain concepts or linguistic structures (Ribeiro et al., 2020).

Model Analysis
To study the current model's strengths and weaknesses in detail, we investigate whether their predictions are consistent with the central spatial expressions.

Spatial Attributes
First, we analyze whether the model predictions are consistent with the entity-level spatial attributes. Since most of them were confirmed to appear inside the markables (Subsection 3.3), we automatically detect all expressions of color in the markables, plot the distributions of the actual referent color, and compare the results between gold human annotation and model predictions (Figure 3).
From the figure, we can verify that the two distributions look almost identical for the common color expressions, and our NUMREF model seems to capture important characteristics of pragmatic expressions (same expression being used for wide range of colors) and modifications such as neutrality (medium) and extremity (very dark, very light). 4 We observed very similar result with the size distributions, which is available in Appendix D.
Based on these results, we argue that the current model can capture entity-level attributes very well, including basic modification.

Spatial Relations
Next, we investigate whether the model predictions are consistent with the central spatial relations. Based on our annotation (Subsection 3.2), we conduct simple tests to check whether the predicted referents satisfy each canonical relation. To be specific, our tests check for two conditions: whether the predictions are valid (satisfy the minimal requirements, e.g. at least 2 referents predicted for near relation), and if they are valid, whether the predictions actually satisfy the canonical relation (e.g. referents are closer than a certain threshold).
Algorithm 1 shows our test for the canonical left 4 Spatial attributes with nuances of subtlety (such as slightly dark) were relatively rare and omitted in the figure.
The results of our tests are summarized in Table  6. We also compare with the feature ablated models to estimate the test cases which can be satisfied without using the corresponding features, i.e. location for direction/proximity/region categories, color for color comparison, and size for size comparison.
First, we can verify that human annotation passes most of our tests, which is an important evidence of the validity of our annotations and tests. We also confirmed that REF models often make invalid predictions with overall poor performance, which is consistent with our expectation.
In direction, proximity and region categories, we found that NUMREF model performs on par or only marginally better than its ablated version (and even underperforms it for simple relations like right and above): these results indicate that current model is still incapable of leveraging locational features to make more consistent predictions. 5 In color/size comparison, NUMREF performs reasonably well, outperforming all other models: this indicates that the model can not only capture but also compare entity-level attributes to a certain extent. However, there is still room left for improvement in almost all relations. It is also worth noting that size comparison may be easier, as the range of size is limited (only 6 compared to 150 for color).
Overall, we conclude that current models still struggle in capturing most of the inter-entity relations, especially those related to placements.

Further Analyses
Finally, we conduct further analyses to study other linguistic factors that affect model performance.    Table 7 shows the results of our relation tests classified by notable linguistic structures.
In terms of modification, we can confirm that human performance is consistently high, while the model performs best for strong modification (modification types of extremity or certainty), decently for neutrals (neutrality or no modification), and worst on weak modification (subtlety or uncertainty). This indicates that large, conspicuous features are easier for the model to capture compared to small or more ambiguous features.
In terms of subject/object properties, human performance is also consistently high. In contrast, model performance is significantly worse for subject ellipsis (inter-utterance subject), while remaining high for object ellipsis and no object cases.
We also hypothesize that a large portion of the relations can actually be satisfied without considering the objects, e.g. by simply predicting very dark dots as the subjects when the relation is darker or darkest. To distinguish such easy cases, we consider a relation as ignorable object if the relation can be satisfied even if we ignore the objects (i.e. remove all object relations) based on gold referents. Our result verifies that there are indeed many cases of ignorable object, and they seem slightly easier for the model to satisfy.  In Table 8, we study the effect of modification based on the absolute difference between subject and object features in comparative relations. 6 In human annotation, the absolute difference naturally increases as the modification gets stronger. While model predictions also show this tendency, their results seem less sensitive to modification (particularly for locational features, i.e. xy-value) and may not be reflecting their full effect.

Discussion and Conclusion
In this work, we focused on the recently proposed OneCommon Corpus as a suitable testbed for fine-grained language understanding in visually grounded dialogues. To analyze its linguistic structures, we proposed a novel framework of annotating spatial expressions in visual dialogues. We showed that our annotation can be conducted reliably and efficiently by leveraging referring expressions prevalent in visual dialogues, while capturing important linguistic structures such as PAS, modification and ellipsis. Although our current analysis is limited to this domain, we expect that upon appropriate definition of spatial expressions, argument roles and canonicalization, the general approach can be applied to a wider variety of domains: adapting and validating our approach in different domains (especially with more realistic visual contexts) are left as future work.
Secondly, we proposed a simple idea of incorporating numerical constraints to improve exophoric reference resolution. We expect that a similar approach of identifying and incorporating semantic constraints (e.g. coreferences and spatial constraints) is a promising direction to improve the model's performance even further.
Finally, we demonstrated the advantages of our annotation for investigating the model's understanding of visually grounded dialogues. Our tests are completely agnostic to the models and only require referent predictions made by each model. By designing simple tests like ours (Subsubsection 4.2.1/4.2.2), we can diagnose the model's performance at the granularity of canonical attributes/relations under consideration: such analyses are easy to extend (by adding more tests) and critical for verifying what capabilities current models have (or do not have). Based on further analyses (Subsubsection 4.2.3), we also revealed various linguistic structures that affect model performance: we expect that capturing and studying such effects will be essential for advanced model probing in visual dialogue research.
Overall, we expect our framework and resource to be fundamental for conducting sophisticated linguistic analyses of visually grounded dialogues.  Here, we show additional examples of our spatial expression annotation. In Figure 4, we show an example dialogue annotated with spatial attributes (colored in red). Since our goal is not to achieve strict inter-annotator agreement but to conduct efficient and useful analysis, we allow certain flexibility in determining the spans of spatial expressions: for instance, the coordinated spatial expression ("small and light") can be annotated as a single expression or as different expressions ("small and light"). Copulas (is, being), articles (a, the), particles (to, with) and modifiers were allowed to be either omitted or included in spatial expressions. Spans were allowed to be non-contiguous, but must be annotated at the token level and restricted to be within a single utterance. Note that spatial attributes (tiny, light) in the first markable ("a lonely tiny light dot") are not annotated, since they are inside the markable and their spans and subjects are relatively obvious.

A Annotation Examples and Details
In terms of argument identification, we prioritize markables in the following manner: 1. Markables in the present utterance (i.e. same utterance as the spatial expression).
2. Markables in the closest previous utterance of the same speaker.
3. Markables in the closest previous utterance of different speakers.
As long as these priorities are satisfied, we did not distinguish between coreferences. Furthermore, for object identification, we did not distinguish between markables which include/exclude subject referents: for example, the object markable for lighter in "I have [three dots], [two] dark and [one] lighter" could be either three dots or two. In Figure 5, we show an example dialogue where the subject markable only appears in the previous utterance ("smaller?" in B's utterance), which demonstrates the case of subject ellipsis. Note that since we only detect expressions that contain specific spatial information of the visual context, we do not annotate black dots in the first interrogative utterance ("how many black dots do u see?").
In Figure 6, we show an example dialogue with unannotatable relation ("going [small], [medium], [large]") which cannot be captured based on the simple argument roles of subjects and objects. In general, similar strategies of enumeration are difficult to be captured, as well as predications with exceptions (such as "[All dots] are dark except [one Finally, we only annotate explicit spatial attributes and relations: therefore, we do not annotate implicit relations such as darker in "One is dark and the other is light gray", although it is inferable. When the spans are difficult to annotate, annotators were encouraged to make the best effort to capture the constructions which refer to specific spatial information.  Table 9: Additional results of our reliability analysis.

B Annotation Results
In Table 9, we show the results of token level agreement for the starting positions of spatial expressions and modifiers. Despite having certain freedom as discussed in Appendix A, we can verify that these also have reasonably high agreement.  In Table 10, we show the frequency of each modification types. Based on these results, we can see that neutrality is the most common type of modification for spatial attributes (as in medium gray, medium sized), and subtlety and uncertainty to be the most common types for spatial relations. It is interesting to note that the frequencies of modification types vary significantly with spatial attributes and relations, except for negation.
In Table 11 and 12, we show the statistics and examples of canonical relations and modification types annotated for our analyses. Note that a single expression can imply multiple canonical relations (e.g. "identical looking" implies same color and same size) or no canonical relation at all (e.g. "forms a triangle"). In contrast, a modifier can have only one modification type: for instance, almost exactly is considered to have the overall modification type of certainty.
In order to collect model predictions for all dialogues and markables, we randomly split the whole dataset into 10 equal sized bins z i (i ∈ {0, 1, 2, ..., 9}) and at each round r ∈ {0, 1, 2, ..., 9} we use z r (mod 10) , z r+1 (mod 10) , ..., z r+7 (mod 10) for model training, z r+8 (mod 10) for validation, and z r+9 (mod 10) for testing. We report the mean and standard deviation of the entity-level accuracy and markable-level exact match rate in these 10 rounds of the experiments. In our NUMREF model, we train a separate module for predicting the number of referents based on a simple MLP (single layer, 256 hidden units). Reference resolution and number prediction are trained jointly with the loss weighted by 32:1. We conducted minimal hyperparameter tuning since the results did not change dramatically. Figure 7 shows the referent size distributions based on human annotation (top) and NUMREF predictions (bottom). We can verify that the two distributions look almost identical for all common expressions, as observed in the color distributions.

E Canonical Relation Tests
For canonical relation tests, we only use relations that are not negated and have all arguments in the same speaker's utterances (so that referent predictions are based on the same player's observation).