Pragmatic factors in image description: the case of negations

We provide a qualitative analysis of the descriptions containing negations (no, not, n't, nobody, etc) in the Flickr30K corpus, and a categorization of negation uses. Based on this analysis, we provide a set of requirements that an image description system should have in order to generate negation sentences. As a pilot experiment, we used our categorization to manually annotate sentences containing negations in the Flickr30K corpus, with an agreement score of K=0.67. With this paper, we hope to open up a broader discussion of subjective language in image descriptions.


Introduction
Descriptions of images are typically collected from untrained workers via crowdsourcing platforms, such as Mechanical Turk 1 . The workers are explicitly instructed to describe only what they can see in the image, in an attempt to control content selection (Young et al., 2014;Chen et al., 2015). However, workers are still free to project their world view when writing the descriptions and they make linguistic choices, such as using negation structures (van Miltenburg, 2016).
In this paper we study the use of negations in image descriptions. A negation is a word that communicates that something is not the case. Negations are often used when there is a mismatch between what speakers expect to be the case and what is actually the case (see e.g. (Leech, 1983;Beukeboom et al., 2010)). For example, if Queen Elizabeth of England were to appear in public wearing jeans instead of a dress, (1a) would be acceptable because she is known to wear dresses in public. But if she were to show up wearing a dress, (1b) would be unexpected.
(1) a. Queen Elizabeth isn't wearing a dress b. ?? Queen Elizabeth isn't wearing jeans Thus the correct use of negations often requires background knowledge, or at least some sense of what is expected and what is not.
We focus on two kinds of negations: nonaffixal negations (not, n't, never, no, none, nothing, nobody, nowhere, nor, neither) (Tottie, 1980); and implicit negations in the form of prepositions (without, sans, and minus), and the verbs lack, omit, miss and fail. Horn (1989) calls this second category 'inherent negatives'. Affixal negations (words starting with a-, dis-, un-, non-, unor ending with -less) are beyond the scope of this paper, but we hope to address them in future work.
The main contributions of this paper are an overview of different uses of negations in image description corpora, analysing the background knowledge required to generate negations, and the implications for image description models. 2

Data
We focus on negations on the Flickr30K dataset (Young et al., 2014). The negations were detected by lexical string-matching using regular expressions, except for the verbs. For the verbs, we checked if any of the tokens starts with lack, omit, miss or fail. Our search yielded 896 sentences, of which 892 unique, and 31 false positives. Table 1 shows frequency counts for each negation term.
We carried out the same analysis for the Microsoft COCO dataset (Chen et al., 2015) to see if the proportion of negations is a constant. Our approach yielded 3339 sentences on the training and no 371 nothing 16 neither 2 not 198 lack 9 sans 1 without 141 fail 9 none 1 miss 69 never 5 nobody 1 n't 68 nowhere 3 validation splits, of which 3232 unique. The presence of negations appears to be a linear function of dataset size: 0.56% in the Flickr30K dataset, and 0.54% in the MS COCO dataset. This suggests that the use of negations is not particular to either dataset, but rather it is a robust phenomenon across datasets. Table 2 shows the distribution of descriptions containing negations across images. In the majority of cases only one of the five descriptions contains a negation (86.25% in Flickr30K and 72.05% in MS COCO). Only in very exceptional cases do the five descriptions contain negations. This indicates that the use of negation is a subjective choice.

Negation uses in image descriptions
In this section, we provide a categorization of negation uses and assess the amount of required background knowledge for each use. Our categorization is the result of manually inspecting all the data twice: the first time to develop a taxonomy, and the second time to apply this taxonomy to all 892 sentences. Note that our categorization is meant as a practical guide to be of use for natural language generation. There is already a unifying explanation for why people use negations (unexpectedness, see (Leech, 1983;Beukeboom et al., 2010)). The question here is how people use negations, what they negate, and what kind of knowledge is required to produce those negations. Salient absence: The first use of negation is to indicate that something is absent: (2) a. A man without a shirt playing tennis. b. A woman at graduation without a cap on.
Shirts and shoes are most commonly mentioned as being absent in the Flickr30K dataset. From examples like (2a) speaks the norm that people are supposed to be fully dressed. These examples seem common enough for a machine to learn the association between exposed chests and the phrase without a shirt. But there are also more difficult cases, such as (2b). To describe an image like this, one should know that students (in the USA) typically wear caps at their graduation. This example shows the importance of background knowledge for the full description of an image.

Example 2a (Image 2883099128)
Negation of action/behavior: The second category is the use of negation to deny that an action or some kind of behavior is occurring: (3) a. A kid eating out of a plate without using his hands. b. A woman in the picture has fallen down and no one is stopping to help her up.
Examples like these require an understanding of what is likely or supposed to happen, or how people are expected to behave.

Example 3a (Image 39397486)
Negation of property: The next use of negation is to note that an entity in the image lacks a property. In (4a), the negation does two things: it highlights that the buildings are not finished, but in its combination with yet suggests that they will be finished.
(4) a. A man wearing a hard hat stands in front of buildings not yet finished being built. b. There are four boys playing soccer, but not all of them are on the same team [. . . ].
In (4b), the negated phrase also performs two roles: it communicates that there are (at least) two teams, and it denies that the four boys are all in the same team. For both examples, the negated parts (being finished and being on the same team) are properties associated with the concepts of BUILD-ING and PLAYING TOGETHER, and could reasonably be expected to be true of buildings and groups of boys playing soccer. The negations ensure that these expectations are cancelled.
Example 4a (Image 261883591) Example (5) shows a completely different effect of negating a property. Here, the negation is used to compare the depicted situation with a particular reference point. The implication here is that the picture is not taken in the USA. Example 6a (Image 2313609814) Outside the frame: The most image-specific use of negation is to note that particular entities are not depicted or out of focus: (7) a. A woman is taking a picture of something not in the shot with her phone. b. Several people sitting in front of a building taking pictures of a landmark not seen.
The use of negation in this category requires an understanding of the events taking place in the image, and what entities might be involved in such events. (7b) is a particularly interesting case, where the annotator specifically says that there is a landmark outside the frame. This raises the question: how does she know and how could a computer algorithm recognise this?
Example 7a (Image 4895028664) (Preventing) future events: The sixth use of negation concerns future events, generally with people preventing something from happening. Here are two examples: (8) a. A man is riding a bucking horse trying to hold on and not get thrown off. b. A girl tries holding onto a vine so she won't fall into the water.
What is interesting about these sentences is that the ability to produce them does not only require an understanding of the depicted situation (someone is holding on to a horse/vine), but also of the possibilities within that situation (they may or may not fall off/into the water), depending on the actions taken.

Example 8a (Image 263428541)
Quotes and Idioms: Some instances of negations are mentions rather than uses as shown in (9).
(9) A girl with a tattoo on her wrist that reads "no regrets" has her hand outstretched.
Other times, the use of a negation isn't concerned with the image as much as it is with the English language. The examples in (10) illustrate this idiomatic or conventional use of negation.
b. Three young boys are engaged in a game of don't drop the melon.

Example 10a (Image 4870785283)
Other: Several sentences do not fit in any of the above categories, but there aren't enough similar examples to merit a category of their own. Two examples are given below. In (11), the negation is used to convey that it is atypical to be holding an umbrella when it is not raining.
(11) The little boy [. . . ] is smiling under the blue umbrella even though it is not raining.
Example 11 (Image 371522748) In (12), the annotator recognized the intention of the toddler, and is using the negation to contrast the goals with the ability of the toddler. Though there are many other sentences where the negation is used to contrast two parts of the sentence (see Section 4), there is just one example where an ability is negated.
(12) A little toddler trying to look through a scope but can't reach it.
We expect have no doubt that there are still other kinds of examples in the Flickr30K and the MS COCO datasets. Future research should assess the degree to which the current taxonomy is sufficient to systematically study the production of negations in image descriptions.

Annotating the Flickr30K corpus
Two of the authors annotated the Flickr30K corpus using the categories listed above with two goals: to validate the categories, and to develop annotation guidelines for future work. By going through all sentences with negations, we were able to identify borderline cases that could serve as examples in the final guidelines.
Using the categories defined in Section 3, we achieved an inter-annotator agreement of Cohen's κ=0.67, with an agreement of 77%. We then looked at sentences with disagreement, and settled on categories for those sentences. Table 3 shows the final counts for each category, including a Meta-category for cases like I don't see a picture, commenting on the original annotation task, or on the images without describing them.  Such examples show how negations can be used to structure an image. Sometimes this leads to a scalar implicature (Horn, 1972), like in (14).
(14) Three teenagers, two without shoes having a water gun fight with various types of guns trying to spray each other. ⇒ One teenager is wearing shoes.
A striking observation is that many negations pertain to pieces of clothing; for example: 282 (32%) of the negations are about people being shirtless, while 59 (7%) are about people not wearing shoes. It is unclear whether this is due to selection bias, or whether the world just contains many shirtless people. But we expect that this distribution will make it difficult for systems to learn how to use negations that aren't clothing-related.

Discussion
The negations used by crowdworkers are likely to have required some form of "world knowledge". We now discuss potential sources of evidence for recognising a candidate for negation in the description of an image: (a) The Outside the frame category requires an understanding of human gaze within an image, which is a challenging problem in computer vision (Valenti et al., 2012). Additionally, we also need to understand the differences between scene types, both from a computational- (Oliva and Torralba, 2001) and a human perspective (Torralba et al., 2006). (b) The Salient absence category provides evidence for two kinds of expectations that play a role in the use of negations: general expectations (people are supposed to wear shirts, cf. 2a) and situation-specific expectations (students at graduation ceremonies typically wear caps, cf. 2b). (c) Finally, the Negation of action/behavior category requires action recognition, which is a challenging problem in still images (Poppe, 2010). The ability to automatically recognise what people are doing in an image, and how this contrasts with what they would typically do in similar images, would greatly help with generating this use of negation.
From a linguistic perspective, background knowledge could be represented by frames (Fillmore, 1976) and scripts (Schank and Abelson, 1977). There are some hand-crafted resources that contain this kind of knowledge, e.g. FrameNet (Baker et al., 1998), but they only have limited coverage. Recent work has shown, however, that it is possible to automatically learn frames (Pennacchiotti et al., 2008) and script knowledge (Chambers and Jurafsky, 2009) from text corpora. Fast et al. (2016) show how such knowledge, as well as knowledge about object affordances (Gibson, 1977), can be used to reason about visual scenes.

Conclusion
We studied the use of negations in the Flickr30K dataset. The use of negations imply that the descriptions contain a combination of objective and subjective interpretations of the images. But negations are only one type of subjective language in image description datasets. We expect that different subjective language use (e.g. discourse markers such as yet or even though) can be observed with relative ease in this and other datasets. Additionally it would be interesting to study the use of negations in different languages, such as the German-English Multi30K dataset (Elliott et al., 2016). We encourage further research to discover other types of subjective language in vision and language datasets, and studies of how subjective language may affect language generation.