Categorisation, Typicality & Object-Specific Features in Spatial Referring Expressions

Various accounts of cognition and semantic representations have highlighted that, for some concepts, different factors may influence category and typicality judgements. In particular, some features may be more salient in categorisation tasks while other features are more salient when assessing typicality. In this paper we explore the extent to which this is the case for English spatial prepositions and discuss the implications for pragmatic strategies and semantic models. We hypothesise that object-specific features — related to object properties and affordances — are more salient in categorisation, while geometric and physical relationships between objects are more salient in typicality judgements. In order to test this hypothesis we conducted a study using virtual environments to collect both category and typicality judgements in 3D scenes. Based on the collected data we cannot verify the hypothesis and conclude that object-specific features appear to be salient in both category and typicality judgements, further evidencing the need to include these types of features in semantic models.


Introduction
Various accounts of cognition and semantic representations have highlighted that, for some concepts, different factors may influence category and typicality judgements (Smith et al., 1974;Rips, 1989). In particular, some features may be more salient in categorisation tasks while other features are more salient when assessing typicality. In this paper we explore the extent to which this is the case for English spatial prepositions and discuss the implications for pragmatic strategies and semantic models. We hypothesise that object-specific features -related to object properties and affordancesare more salient in categorisation, while geometric and physical relationships between objects are more salient in typicality judgements. In order to test this hypothesis we conducted a study using virtual environments to collect both category and typicality judgements in 3D scenes. Based on the collected data we cannot verify the hypothesis and conclude that object-specific features appear to be salient in both category and typicality judgements, further evidencing the need to include these types of features in semantic models.
The spatial prepositions we analyse are those considered to have a functional component as well as those prepositions that seem to act as their geometric counterpart. For the 'functional' prepositions, object affordances and functional relationships, such as support and location control, appear to be salient (Garrod et al., 1999;Coventry et al., 2001) compared to the geometric counterparts where geometric features and relative positions of objects appear to be more salient. In English, we consider the functional prepositions to be: 'in', 'on', 'over' and 'under'; and their respective geometric counterparts to be: 'inside', 'on top of', 'above' and 'below'. We also consider 'against' to be a functional preposition (Talmy, 1988), without a clear geometric counterpart.
To clarify the distinction between category and typicality judgements, we suppose that a category decision is when an entity is labelled with a category or concept and, though priming and context may certainly be factors, the judgement is not made in direct comparison with another entity. For example, a categorisation judgement occurs when an agent is asked whether 'the apple is in the bowl'. In order to reply, the agent judges the membership of the instance in the relevant category and the wider context plays a relatively minor role.
Typicality usually refers to the extent to which an entity is a good example of a concept -how similar it is to some ideal conceptual representation. In our study we ground the notion of typicality in comparison and preference -an entity, x, is more typical of a category than entity y if, when x is compared with y, people in general pick x as a better category member. For example, when requested to pick 'the apple in the bowl', an agent must compare a set of candidate objects for how well they fit the description.
By assessing typicality in this way we distinguish the notion of typicality from graded category membership. The typicality data that we collect does not arise from graded membership judgements where study participants are asked to assign a value of how well the concept fits the category, e.g. in (McCloskey and Glucksberg, 1978;Hampton, 1997), but instead from tasks in which participants select the best fitting instance from a given description.
In existing models of spatial language (and semantic models more generally), it is generally assumed that the underlying semantics of categorisation and typicality are essentially the same. However, as we will discuss in Section 4, appropriately distinguishing categorisation and typicality judgements is important when generating and processing referring expressions.

Background
Following various criticisms of definitional representations of concepts in human cognition (Wittgenstein, 1953), Rosch's Prototype Theory (Rosch, 1978) provided an approach based on family resemblance which does not presuppose that concepts have necessary and sufficient conditions for making category judgements. By relying on a degree of resemblance to some prototypical notion of a concept, category membership in this account may be treated as a matter of degree. With such an account it becomes appealing to conflate the notions of categorisation and typicality -the more an entity resembles a prototype the more likely it is to be labelled with the category and the more typical it is. There are however various accounts of concept analysis which suggest that category and typicality judgements are fundamentally different. Smith et al. (1974) consider the influences of category decisions and propose a model to account for experimental findings. Central to the model is the 'characteristic feature assumption', which supposes that features vary in the extent to which they define a concept. Smith et al. suppose that there is a distinction in types of features -'defining' features which strongly influence category judgements and 'characteristic' features which strongly influence typicality judgements -and give the example of 'robin' to illustrate this. For the concept 'robin', 'have wings' is an important defining feature relating to the categorisation of an entity as 'a robin', while 'perches in trees' is a characteristic feature which relates to how typical an entity is of 'a robin'. Rips (1989) argued that categorisation of some entity is more than simply a judgement of how similar the entity is to some typical representation of the category. Rips provides support for this in an experiment where participants are asked to imagine an object of a given size and are asked which of two concepts, A, B say, the object is more similar to and which the object is most likely to be. For the given concept pairs, e.g. 'pizza' and 'quarter', the hypothetical object may be considered more similar to one of the concepts yet more likely to be the other. In the case of the pizza and quarter, a round object with a three inch diameter is regarded as more similar to a quarter as pizzas are rarely so small, but more likely to be a pizza as the size of a quarter is generally fixed and is less than 3 inches.
This issue is explored further in (Osherson and Smith, 1997) where it is again argued that concept membership and typicality are distinct phenomena. Based on a study in (Berlin and Kay, 1991), Osherson and Smith argue that the notions differ using the seemingly extreme example of the concept 'red', even though it may be hard to imagine distinct defining and characteristic features for a concept with such simple semantics. However, the noted difference in judgements arises as people recognise particular wavelengths of light as being unambiguously red yet less typical than prototypical red. This difference in judgements preserves a monotonic relationship between category and typicality judgements i.e. if an entity, x, is a better category instance than y, then it is not less typical than y.
We believe that such monontonicity results offer a trivial case which may be explained without any fundamental modification of the underlying semantics or how they are processed. In the simple case of the colour red, we may represent the semantics in both categorisation and typicality judgements by considering the dominant wavelength -an instance is more or less typical and a better or worse category instance based on the similarity of the dominant wavelength to prototypical red and the distinction in category and typicality judgements is explained via a threshold which is applied in category judgements.
However, in the case of spatial language, the semantics are more complex and are influenced by a variety of geometric, functional and objectspecific features. We believe that as a result, the relationship between typicality and categorisation may in fact be non-monotonic -there may be instances of a spatial preposition, i 1 , i 2 , such that i 1 is a better category member but less typical than i 2 .

A note on terminology
Regarding the names of the objects being discussed we use figure (also known as: target, trajector, referent) to denote the entity whose location is important e.g. 'the bike next to the house' and ground (also known as: reference, landmark, relatum) to denote the entity used as a reference point in order to locate the figure e.g. 'the bike next to the house'. We call potential figure-ground pairs configurations.

Object-Specific Features
It is apparent that various object properties and affordances influence the categorisation and usage of spatial prepositions (Coventry et al., 1994;Feist and Gentner, 1998). For example, the animacy of the figure object may influence a decision to use 'in' or 'on' 1 (Feist and Gentner, 1998). Following (Coventry et al., 1994), we call these features 'object-specific' features.
We believe that these types of features provide a source of disagreement between category and typicality judgements and below for each of the functional prepositions we describe the object-specific features that are explored in this paper.
In As 'in' expresses a notion of containment, the ability of the ground to contain the figure is often salient whether or not the ground does in fact contain the figure in a geometric sense. Therefore, whether the ground is a type of container appears to be salient for 'in' (Coventry et al., 1994;Feist and Gentner, 1998;Richard-Bollans et al., 2019) and this may be considered a salient object-specific feature.
Over/Under 'over' and 'under' appear to have a 'covering' sense which is closely related to the functions of the figure and ground (Coventry et al., 2001;Tyler and Evans, 2001;Mori, 2019). For example, a covering object like a lid may exhibit this sense of 'over' when covering a container. Therefore, whether or not the figure is a covering object or the ground is a type of container may be salient object-specific features.
There is also a non-covering sense where a specific functional interaction exists between part of the figure and ground. For example, a tap may be 'over' a sink if only the spout of the tap is above the sink. Similarly, an object may be 'under' a lamp when the object is not under the lamp in a geometric sense but the light from the lamp shines on the object. These specific functional interactions rely on particular properties of the figure or ground and so we consider them to be object-specific features.
Relating to the functional interactions of the figure and ground, an intermediary object between the figure and ground may serve to block any functional interaction, as studied in (Coventry et al., 1994), and diminish the effect of any object-specific features which are present.
Against 'against' is commonly used to denote contact between two objects and as argued in (Herskovits, 1987) is more applicable in situations where the ground object is fixed and the figure is mobile. For example, one may describe a chair as being 'against a wall' but it would be odd to describe a wall as being 'against a chair'.
On 'on' is ubiquitous in the English language and is applied to many situations where usually at least one of the following hold: the figure is supported by the ground, the figure is above the ground or the figure is in contact with the ground. As a result, it is not clear that there are particular properties of figure or ground objects at table-top scales which create strong preferences for 'on'.
As discussed above, the preposition 'in' is often preferred when the ground object is a container. 'on' is therefore used less frequently in these scenarios (Feist and Gentner, 1998), even though the physical relationships between the objects often fulfil the requirements for 'on'. As a result, whether or not the ground is a container appears to be a salient object-specific feature for 'on'.
Finally, 'on' may be used to denote attachment of the figure to the ground. It is therefore plausible that, similarly to 'against', 'on' is more applicable in situations where the ground object is fixed relative to the figure.

Hypothesis
Our main hypothesis is that categorisation and typicality judgements may differ for spatial prepositions in the following manner: given configurations c 1 , c 2 and preposition P , participants may be more likely to categorise c 1 with P yet more likely to select c 2 as a better instance of P . We hypothesise that this may arise in part because particular features are salient in category judgements which become less salient in typicality judgements. Note that under this hypothesis the relationship between categorisation and typicality is nonmonotonic, making this in a sense stronger than the findings related to 'red' discussed by (Osherson and Smith, 1997).
In general, we expect that object-specific features will be more salient in categorisation while geometric and physical relationships, such as containment or support, will be more salient in typicality judgements.
Furthermore, we believe that this particular distinction may be more pronounced for the functional prepositions than for their geometric counterparts. This would be somewhat a corollary of the assumption that functionality in general is more salient for the former, for which tentative evidence is provided in (Richard-Bollans et al., 2020a) for these prepositions and (Coventry et al., 2001) in the case of 'over/above' and 'under/below'.

Generating & Understanding Referring Expressions
In this section we discuss the motivation for this investigation and the possible implications of a significant distinction between category and typicality judgements. In particular, we consider the ramifications for the field of Referring Expression Generation and Comprehension (REG/C), where noun phrases are used to identify entities e.g. 'the box under the table'. Humans often prefer brief ambiguous descriptions over lengthy unambiguous descriptions (Rohde et al., 2012), and expressions involving spatial prepositions often fulfil this desire for brevity. For example, rather than referring to objects based on elaborate visual attributes like 'the yellow cup with two pink dots on it', humans often refer to objects using simple locative expressions, say 'the cup next to the stapler'. We also see many examples of expressions containing spatial prepositions in the SemEval-2014 corpus (Dukes, 2014) and the HuRIC corpus (Bastianelli et al., 2014), both of which consider natural language commands given to robots.
To explain why a distinction in category and typicality judgements is important in these scenarios, suppose we have a speaker and listener, intended referent, r, set of distractor objects 2 and suppose that the speaker is generating a description, D.
When the speaker generates an utterance in order to refer to r, there are various semantic and pragmatic considerations they must make. A naive model of such a speaker may simply find a concept within its vocabulary which is most suitable for r. This would clearly be a flawed strategy, however, as there may be other entities in the scene which better fit the concept -an expression may be true but not satisfy the speaker's communicative goals and so pragmatics must be considered. A more refined speaker model may find a description which best distinguishes r from the distractor objects in O, similar to the algorithm of (Dale, 1989) which aims to maximise the discriminatory power of a description while minimising superfluous information.
It appears that the speaker must model how well an object fits a description compared to other objects (typicality judgement) and how well a description fits an object (category judgement). Further, it is apparent that humans will reason recursively about possible intentions of speakers and possible interpretations of listeners (Goodman and Frank, 2016), making these judgements also necessary for listeners.
Let us consider a more concrete example. Suppose we have a scenario as in Figure 1 where a bowl, b, is on a table and there is one cube, c red , in it and one cube, c blue , next to and not touching it. It seems plausible that humans are more likely to categorise the configuration (c blue , b) with the preposition 'near' than the configuration (c red , b) even though when comparing the configurations humans may agree that c red is more 'near' the bowl than c blue .
Suppose a speaker gives an utterance 'the cube near the bowl' in order to refer to c blue which an agent must interpret. As may be expected, semantic models, e.g. (Platonov and Schubert, 2018), are likely to assign a better score for 'near' to (c red , b) than (c blue , b). If the system has a crude strategy for interpretation which simply selects the configuration with the highest semantic score, then such a system would erroneously select c red .
Many systems with more sophisticated pragmatic strategies, e.g. (Golland et al., 2010), have been developed which aim to take into account and reason with the possible utterances available to the speaker. In this case, such a system may correctly select c blue if it recognises that other better utterances would have been available to the speaker if the intended referent was c red . 'the cube in the bowl' would be a clear example of such an utterance which seems to clearly identify c red over c blue . However, supposing that our hypothesis is correct, we contend that this marked distinction is not simply a matter of typicality and the fact that it would be unusual to categorise (c red , b) with 'near' is more salient than any distinction in typicality between the two configurations. As c red is not 'in' the bowl in an ideal sense we can imagine that a semantic model based simply on the physical relationships between the objects may provide a more marked distinction between the configurations for 'near' than for 'in'. In this case, 'the cube in the bowl' wouldn't necessarily seem like a better utterance to identify c red than 'the cube near the bowl'.
Moreover, even if typicality on its own works well for REC, understanding and modelling the differences between categorisation and typicality is important for producing natural utterances in REG. For instance, suppose the speaker creates an utterance where r is more typical for the concept, C say, in D than any of the distractor objects. If categorisation is not aligned with typicality, such a strategy may produce unusual utterances where, though r is typical for C, it is uncommon to categorise r with C. Such unnatural utterances may trigger unwanted conversational implicatures and be a source of confusion. For example, the utter-ance 'the ball on the bowl' in the context of Figure  1 would be an unusual way to describe the ball as the preposition 'in' is often used with objects such as bowls. From this unconventional usage of 'on' a listener may imply that for some reason 'in' was not suitable e.g. if the speaker is actually referring to another unseen ball. The issue of producing natural utterances has been recognised by others in the field (Krishnaswamy and Pustejovsky, 2019) and is an important challenge to overcome if we are to develop more sophisticated REG systems.
With regards to existing models of spatial language, it is generally assumed that the underlying semantics of categorisation and typicality are essentially the same. For example, in the PRAGR mechanism proposed in (Mast et al., 2016) a pragmatic strategy is presented which aims to maximise both the acceptability and discriminatory power of a description. Acceptability is calculated using similarity to a prototype based on physical relationships while the discriminatory power is calculated considering the acceptability of the description for the referent compared to other distractor objects.

Data Collection
In order to investigate typicality measures and categorisation judgements in detail, we conducted a study which is described below. Collected data, details of the framework and results of the analysis can be found in the Leeds research data repository. 3 The latest version of the data collection environment and code for analysis can be found in the GitHub repository. 4

Environment & Tasks
The data collection framework is built on the Unity3D 5 game development software, which provides ample functionality for the kind of tasks we implement. Two tasks were created for our studya Categorisation Task and a Typicality Task. The former allows for the collection of categorical data while the latter provides typicality judgements.
In the Categorisation Task participants are shown a figure-ground pair (highlighted and with text description, see Figure 2) and asked to select all prepositions in the list which fit the configuration. Participants may select 'None of the above' if they deem none of the prepositions to be appropriate. Often categories and concepts are viewed as antagonistic entities; for example work on Conceptual Spaces is often concerned with comparison of categories, e.g. partitioning a feature space (Douven et al., 2013). We believe however that the vagueness present in spatial language is so severe that it is not clear that a meaningful model distinguishing the categories is possible. For this reason, participants in our studies are asked to select all possible prepositions for a configuration rather than a single best-fitting preposition.

44
In the Typicality Task participants are given a description and shown two configurations, see Figure  3. Participants are asked to select the configuration which best fits the description. Again, participants can select none if they deem none of the configurations to be appropriate.
In order to minimise differences in the tasks that may elicit different conceptualisations of objects in the scenes, the phrasing of the descriptions is the same in both the Categorisation Task and Typicality Task e.g. both tasks use the definite determiner 'the' and objects are referred to by their colour.

Scenes
We hypothesise that object-specific features strongly influence category decisions while the geometric ideals associated to the prepositions are more salient in typicality decisions. Scenes are therefore created for each preposition which vary the degree to which these object-specific features are present and also vary how similar the relational aspects of the configurations are to the geometric ideals associated with the given preposition. For example for the preposition 'in', we have a scene where the ground is a container and the figure is not very well contained in it and a scene where the ground is not a type of container but the figure is well contained in it. In this case, if the hypothesis is correct, we would expect a preference for categorisation in the former and a preference for typicality in the latter.
We have created 18 virtual 3D scenes each containing a single highlighted figure-ground pair. Four scenes each were created for 'in', 'on', 'over' and 'under' and these scenes were also shared with their respective geometric counterparts: 'inside', 'on top of', 'above' & 'below'. Two scenes were created for 'against'. In the Typicality Task, participants compare scenes/configurations associated with the preposition given in the description.

Study
The study was conducted online and participants from the university were recruited via internal mailing lists along with recruitment of friends and family. 6 Each participant performed first the Categorisation Task on 6 randomly selected scenes and then the Typicality Task on 15 randomly selected scenes, which took participants roughly 5 minutes. 30 native English speakers participated providing 180 annotations in the Categorisation Task and 447 annotations in the Typicality Task.
As the study was hosted online, we first asked participants to show basic competence. This was assessed by showing participants two simple scenes with an unambiguous description of an object. Participants are asked to select the object which best fits the description. If the participant makes an incorrect guess in either scene they are taken back to the start menu.

Results
In this section we use the collected data to test the hypothesis and conclude that we cannot verify the hypothesis. We propose that this is because objectspecific features are in fact salient in both category and typicality judgements and provide examples to support this.

Comparing Categorisation & Typicality
To analyse the collected data, we consider pairs of tested configurations for each preposition and evaluate the degree to which category and typicality judgements differ. For each pair, (c 1 , c 2 ), we first decide whether c 1 is a genuinely better category instance than c 2 or is more typical than c 2 . To do this we simply use a hypothesis test with significance level 10% and null hypothesis that the given configurations are equally likely to be labelled with the preposition (in the category case) or equally likely to be selected (in the typicality case). In the category case the p-value is calculated using the one-tailed version of Fisher's exact test. In the typicality case the p-value is simply: where N is the number of times the pair is tested and C 1,2 is the number of times c 1 is selected over c 2 . In 22 out of the 49 given pairs, one of the configurations is a significantly better category instance or is more typical than the other.
Considering the somewhat trivial case, similar to the case of 'red' discussed in Section 2, where two entities are both unambiguous cases of a concept but one of the entities is more typical than the other; there is one instance of this in our dataset. For the preposition 'under' the configurations shown in Figures 4 & 5 were both always labelled with the preposition, out of seven tests in the former and ten tests in the latter, but the (bin, table) configuration in Figure 4 is significantly more typical than the (notepad, lamp) configuration in Figure 5. As previously discussed, however, this is an unsurprising result.
Regarding the main hypothesis of this paper, there are no pairs of configurations in our dataset where one of the configurations is a significantly better category member and the other is significantly more typical. Moreover, in only nine pairs is there any possible disagreement where one of Figure 5: The notepad under the lamp the configurations is more often labelled with the preposition and the other configuration is more often selected in the Typicality Task -in all but one of these cases neither configuration is a significantly better category member or is significantly more typical. We therefore cannot conclude that our hypothesis is correct and it appears that the notions of categorisation and typicality do not significantly vary due to object-specific featuresthese appear to be both defining and characteristic features.
Clearly we have only tested a small number of features and there are a vast array of salient features for each preposition for which the hypothesis may still be correct. However, our results suggest that object-specific features are salient in both categorisation and typicality judgements -in some cases the object-specific features appear to have a stronger influence than the physical relationships. Interestingly, there is some tentative evidence that this extends in general to the geometric counterparts and suggests that these prepositions are not purely spatial -supporting findings in (Dobnik and Ghanimifard, 2020).

Importance of Object-Specific Features
In the following we provide some examples from our dataset which highlight the importance of object-specific features.
On For the preposition 'on', the (mug, pencil) configuration in Figure 6 is both a significantly better category member and is significantly more typical than the (pear, bowl) configuration in Figure 7. Regarding the physical relationships, (pear, bowl) appears to be a better example of 'on' than (mug, pencil). If we consider the usual salient features for 'on': • The pear is fully supported by the bowl, while the mug is leaning on both the pencil and the table • There is a high degree of contact between the pear and the bowl compared to the mug and pencil • The entirety of the pear is above some part of the bowl, while the bottom of the mug is level with the bottom of the pencil We therefore believe this result is not due to the physical relationships of (mug, pencil) better representing 'on' than (pear, bowl) and that this result arises primarily because 'on' is generally not used for containers. One may have expected this result if the objects in the experiments were named -'on the bowl' sounds strange while 'on the pencil' seems more plausible. It is therefore even more surprising given that the objects were not named in a way that influences the decisions.
In/Inside For the prepositions 'in' and 'inside', the (pear, bowl) configuration in Figure 8 was more likely to be selected in the Typicality Task than the (cube, shelf) configuration in Figure 9. If we measure containment simply as the degree to which one object is contained in the bounding box or convex hull of another, as is common in grounded semantic models e.g. (Chang et al., 2014;Platonov and Schubert, 2018), then the cube would be fully contained by the shelf whereas the pear is not even partially contained by the bowl.
It therefore appears that the role of the bowl as Over/Above For the preposition 'over', the (tap, sink) configuration in Figure 10 is both a significantly better category member and is significantly more typical than the (lid, pan) configuration in Figure 11. The same is true for the preposition 'above', though in this case the results are not significant. Again, the physical relationships of (lid, pan) appear to better capture the geometric meanings of 'over ' and 'above' than (tap, sink). There is also some functional interaction between the objects in both cases -lids are used to cover pans and sinks are placed below taps to catch the water. The preference for the (tap, sink) configuration is therefore However, the lid does not appear to be properly fulfilling its functional role, as it is not fully covering the container part of the pan. This may explain the preference for (tap, sink) and further highlight the importance of considering functional interactions based on usual object usages.

Discussion
Regardless of the hypothesis, these results highlight the importance of including object-specific features in semantic models of spatial language. With the possible exception of (Platonov and Schubert, 2018), these types of features are rarely included in semantic models and many systems are developed in block-world type environments, e.g. (Spranger, 2013;Mast et al., 2016;Perera et al., 2018), where these types of features are not needed.
One approach to improving semantic models may be to incorporate information from knowledge bases such as ConceptNet (Speer and Havasi, 2012) or AfNet (Varadarajan and Vincze, 2012). For example, from ConceptNet one can determine that lids are used for covering and that bowls are containers. Another approach is to leverage affordance detection systems, e.g. (Do et al., 2018), which use information from the scene to predict object affordances.
In (Rodrigues et al., 2020) some object-specific features, e.g. whether or not the ground is a type of container, are taken to distinguish separate polysemes. 7 By leveraging previous work on modelling the polysemy of spatial prepositions (Richard-Bollans et al., 2020b), it may be possible to incorporate object-specific features into a semantic model by using these features to distinguish polysemes. In order to carry this out, further 7 A word is said to exhibit polysemy where the word has multiple related senses. Each of these senses is called a polyseme.
work identifying salient object-specific features for each preposition would be beneficial and a larger dataset is needed which provides more instances of prepositions representing a greater variety of object-specific features.