A case study on context-bound referring expression generation

In recent years, Bayesian models of referring expression generation have gained prominence in order to produce situationally more adequate referring expressions. Basically, these models enable the integration of different parameters into the decision process for using a specific referring expression like the cardinality of the object set, the configuration and complexity of the visual field, and the discriminatory power of available attributes that need to be combined with visual salience and personal preference. This paper describes and discusses the results of an empirical study on the production of referring expressions in visual fields with different object configurations of varying complexity and different contextual premises for using a referring expression. The visual fields are set up using data from the TUNA experiment with plain random or pragmatically enriched configurations which allow for target inference. Different categories of the situational contexts, in which the referring expressions are produced, provide different degrees of cooperativeness, so that generation quality and its relations to contextual user intention can be observed. The results of the study suggest that Bayesian approaches must integrate individual generation preference and the cooperativeness of the situational task in order to model the broad variance between speakers more adequately.


Introduction
In the past, experiments on the production of referring expressions (REs) produced corpora on domains of different complexity, among those the TUNA corpus (van der Sluis, Gatt, van Deemter, 2006;2006 online manual), GRE3D3 and GRE3D7 (Viethen & Dale, 2008;2011), ReferIT (Kazemzadeh et al., 2014), Wally (Clarke et al.,2013) and some interlingual experiments revealing that the basic concepts of reference are independent from language expertise (e.g. Khan & Siddiqui, 2015). Da Silva Rocha & Paraboni (2018) distinguish two general experimental designs in the REG task, related to the speakerlistener configuration: monologue and dialogue. The authors remark that "both dialogue and monologue are of course instances of real language use but, at least from these studies, it is not entirely clear whether the two situations are truly comparable" (p.2994). Questionable is still, whether or not content determination and the resulting generation quality, i.e. underspecification, minimality or overgeneration, may differ not only according to the speaker-listener configuration but also according to the context in which the REG task is situated. This question also includes variance between speakers. The experiment described in this paper builds on its predecessors, focusing on the technical and contextual parameters that may trigger differences in generation quality and content determination during production. The goal is to provide empirical data clarifying the influence of the situational context on the generation quality of referring expressions.

Methods
The experiment is designed using the TUNA furniture corpus and a subset of the TUNA people corpus that has been selected in a balanced way, making each feature value combination unique. It is conducted as a web-based experiment. Data from native speakers of English is collected using the crowdsourcing platform Amazon Mturk. The compiled corpus consists of 1029 production sessions from 50 participants.   Production sessions were associated with different contexts which are representative of different communicative intents of the dialogue. Contexts are given in table 1. Either no contextual text was given and the participants were asked to generate expressions to their liking, or the context type was randomly chosen according to the domain type of the production session.
The contexts marked with + are designed with focus on the speaker's interest. In these contexts, the speaker envisages some personal intention for which it is important to convey to the listener which object he/she refers to. Correct identification is important to the speaker. The contexts marked with O are designed as rather neutral, where correct identification is of equal importance for both speaker and hearer in a collaborative task. The -marker indicates that these contexts focus on the hearer's interest, implied by the fact that the production task is the answer to the hearer's question. Correct identification is more important to the hearer than to the speaker.

Results
In this experiment, the main parameter of potential influence on the generation quality is the situational context. Consequently, all context conditions need to be evaluated in regard to overgeneration, minimality and underspecification. Examples of referring expressions produced by the participants and the corresponding context condition as well as the general session configuration are given in table 2.  The absence of a situational context (NONE condition) results in a nearly equal distribution of overgeneration and minimality (36.3% and 34.2%), while underspecification is slightly lower with a percentage of 29.5% (compare figure 2).
In contrast to this, the neutral context marked with O has a significantly higher ratio of minimal expressions, while overgeneration is close to equal in comparison. Underspecification occurs much less frequently (19.8%) than in sessions without situational context. The resulting difference between O and NONE is significant (χ 2 :5.66; p < 0.05). The + marked context shows a nearly equal distribution of overgeneration and minimality (32.0% and 31.0%), while there is a slight tendency towards underspecification (36.9%). The results for sessions withmarked contexts are diametrical to the + contexts (not significantly, though), revealing an approximately mirrored distribution of underspecification and minimality (32.4% and 30.0%), while overgeneration is slightly ahead with a ratio of 37.6%. Neither + nor -are significantly different from the sessions without context (NONE condition) but both are significantly different from the neutral O context (χ 2 : 11.37/10.15, p : 0.003/0.006).
The positive and negative contexts show tendencies towards overgeneration and underspecification respectively, but in opposite relation to the prior expectation. Contexts marked with +, in contradiction to intuitive assumptions, trigger more underspecification. A possible explanation for this is that the speaker may pay less attention to unique identification because it is only important to him-