Talking about other people: an endless range of possibilities

Image description datasets, such as Flickr30K and MS COCO, show a high degree of variation in the ways that crowd-workers talk about the world. Although this gives us a rich and diverse collection of data to work with, it also introduces uncertainty about how the world should be described. This paper shows the extent of this uncertainty in the PEOPLE-domain. We present a taxonomy of different ways to talk about other people. This taxonomy serves as a reference point to think about how other people should be described, and can be used to classify and compute statistics about labels applied to people.


Introduction
There are currently two major data sets used to train and evaluate automatic description systems: Flickr30K and MS COCO (Young et al., 2014;Lin et al., 2014). Both of these data sets contain images with multiple crowd-sourced descriptions per image. These datasets are typically used to train data-driven natural language generation systems to automatically learn to associate visual features with natural language descriptions (Bernardi et al., 2016). Following the training phase, image description systems are evaluated by comparing their output with human generated descriptions for the same image (using textual similarity metrics like BLEU or METEOR, Papineni et al. 2002;Denkowski and Lavie 2014). The standard for what the image descriptions should look like is implicit in the corpus. The only point at which any explicit guidelines are provided is during the crowd-sourcing task, where annotators are given general instructions about what their description should look like.
Here are the Flickr30K instructions (the MS COCO instructions are similar): 1. Describe the image in one complete but simple sentence. 2. Provide an explicit description of prominent entities. 3. Do not make unfounded assumptions about what is occurring. 4. Only talk about entities that appear in the image. 5. Provide an accurate description of the activities, people, animals and objects you see depicted in the image. 6. Each description must be a single sentence under 100 characters. (Hodosh et al., 2013, edited for brevity) These guidelines leave much of the task open for interpretation by the annotator. For example, it is unclear how the descriptions will be used, or what the target audience is (as pointed out by van Miltenburg et al. 2017). Thus, the underspecified nature of the task invites variation and creativity. It is important for us to understand the extent of this variation because image description corpora currently set the standard for what an image description should look like.
Earlier work has looked at stereotyping behavior, reporting bias, and the use of negations in image descriptions (van Miltenburg, 2016;Misra et al., 2016;van Miltenburg et al., , 2017, and recently Van Miltenburg et al. (2018) provided an overview of measures to quantify diversity. This paper looks at the variation in the labels used to refer to other people, and presents a taxonomy (based on the Flickr30K dataset) that shows the range of properties that crowd-workers consider in the description process. This taxonomy ranges from physical attributes, such as hair color, to attributes concerning socio-economic status (e.g. unemployed).
After discussing related work ( §2), we present our method to select person-labels and to categorize (partial) labels into semantic categories ( §3).
Following this, Section 4 shows the resulting taxonomy, with examples from the Flickr30K dataset. Section 5 discusses our taxonomy in light of the recently published Face2Text dataset (Gatt et al., 2018), and considers the reliability of perceived attributes. We believe these contributions will be useful for practitioners interested in the generation of person-descriptions. Our code and data is publicly available online. 1 2 Related work Natural Language Generation researchers tasked with describing other people have mostly been concerned with generating referring expressions without visual context, usually for well-known entities (e.g. Castro Ferreira et al. 2016;Kutlak et al. 2016). The closest related work comes from Gatt et al. (2018) and Van Miltenburg (2016). Gatt et al. (2018) present a dataset of images of human faces, with multiple elicited descriptions per image. They annotated a part of their dataset to estimate how many of the descriptions refer to physical (85%), emotional (44%), or inferred (46%) properties of the subjects depicted in the images. In contrast, the present paper presents a more precise taxonomy, and discusses Gatt et al.'s Face2Text dataset in light of this taxonomy.
Van Miltenburg (2016) used the Flickr30Kentities dataset (Plummer et al., 2015) to cluster entity mentions based on their co-reference to the same entities. We refer to these mentions as labels. Their clustering approach yields hundreds of groups of labels referring to similar entities. Here is one of those clusters, relating to FACIAL HAIR: beard, goatee, beard and mustache, gray beard, black beard, white beard, red beard, braided beard, gray braided beard, long, white beard, long brown beard, flaming red beard, big beard, short beard, bubble beard, large white beard, thick beard, neatly trimmed beard, scruffy beard, red facial hair Van Miltenburg (2016) uses this example as anecdotal evidence for the richness of image description data, without further analysis. Our paper aims to provide a deeper analysis of the labels used to refer to other people, by manually categorizing the labels into semantically coherent sub-groups. For example, if we look more closely at the FA-CIAL HAIR cluster, we can see that these terms include references to the KIND OF HAIR (beard, goatee, mustache), the COLOR (gray, black, white), LENGTH (long, short), SIZE (big, large), ORDER-LINESS (neatly trimmed, scruffy), and PRESENTA-TION (braided). This means that, when asked to talk about an image, people consider at least six different variables just to describe facial hair.

Method
We created a taxonomy of the labels used to refer to other people by manually sorting the entity labels into different semantic categories, instead of clustering the labels (as in Van Miltenburg 2016). The advantage of manually sorting the labels is that we have full control over the categories. This makes it possible to make more fine-grained distinctions, and to show the breadth of the label distribution. In this paper, we use the English Flickr30K corpus, focusing on the different ways that crowd-workers describe other people.

Initial selection
The starting point for our categorization is a list of labels. We compiled this list using the Flickr30Kentities annotations provided by Plummer et al. (2015), and listed all labels that were classed as PEOPLE. After normalization, we found 19,634 unique labels, which is too much to categorize by hand. 2 (It is not possible to crowd-source our categorization task, because the categories are not known beforehand.) Hence we focus our efforts only on the 5,526 labels that end with any of the nouns girl, boy, woman, man, female, male, or any of their plural forms. 3 This makes the task more manageable, but it also potentially reduces the variation in the data because the selected labels are more homogeneous. Nevertheless, as we will see in Section 4, we still found a broad range of variation in the labels.
During the categorization task, we found several typing errors, and words unrelated to peoplelabeling. We addressed these issues by semiautomatically correcting the typing errors, and creating a list of stopwords that were automatically removed from the labels. This further reduced the number of unique labels-to-be-categorized from 5526 to 3401.

Sorting procedure
We manually sorted (partial) labels into semantic categories, shown in Table 1. Nothing crucially hinges on these specific categories, but from our experience with image description datasets, we believe they provide a good first approximation, capturing the breadth of the labels used by the crowd. Our sorting procedure works as follows.
1. Start with a set of labels to be categorized. 2. Remove task-specific stopwords and unrelated phrases (e.g. a picture of ) from the labels. This reduces the number of unique labels. 3. Select (partial) labels from the list, add them to an existing category file, or create a new category file with those labels. 4. Match the labels with the categories. We use a context-free grammar (CFG, see Figure 1; implemented using the NLTK, Bird et al. 2009) because each label may consist of multiple modifiers from different categories. For example: African-American young man has both ETHNICITY and AGE modifiers. 5. Remove matches from the set of labels to be categorized. 6. Either stop categorization, or go to 3.  Our goal is to get an overview of the different kinds of labels used by the crowd-workers, not to achieve a perfect categorization of all labels. Thus, our stopping criterion is as follows. The sorting task is finished whenever there are no more examples matching existing categories, or warranting new categories. New categories are warranted if there are multiple (partial) labels that clearly fall under the same umbrella, but do not fit into any of the existing categories.

Results
We sorted the (partial) labels into 20 different categories, until we were left with only 341 labels (10%) that could not be fully matched with our categories by the CFG matcher. Examples of uncategorized labels are birthday girl and blood pressure of a man. The former could be classed as a role associated with an event, but we did not find many such examples. The latter is an artifact of the automated label categorization process for the Flickr30K Entities dataset. Table 1 shows the 20 different label categories, with examples for each category. With this table, we have an empirically derived taxonomy that provides an overview of the choices that crowdworkers make in order to describe other people. The different categories show the diversity and breadth of the label distribution. In future work, we hope to extend the coverage of our taxonomy (ideally to all 19,634 person-labels in Flickr30K-Entities), and present statistics about the proportion of person-labels from the Flickr30K dataset that fall into each category.
Our taxonomy also provides a reference point to think about the characteristics that we would and would not like image description systems to describe. For example, the automatic description of features like RELIGION, WEIGHT, or SOCIAL GROUP would probably do more harm than good. Table 1 also shows us what makes image description difficult. For this domain alone, to produce human-like descriptions, systems need to be able to predict 20 different kinds of features, and decide which feature values are relevant to mention. A further complication is that even after deciding which characteristics to describe, there are still within-category choices to be made. For example, when describing a game of basketball, one might choose to talk about a man playing basketball (seeing basketball-playing as a transient property), or male basketball player (seeing basketball-playing as an inherent property). These choices go beyond the scope of this paper, but see Beukeboom 2014; Fokkens et al. 2018 for a discussion.

Extending the taxonomy to Face2text
We obtained the Face2Text corpus (Gatt et al., 2018, v0.1) from the authors to see to what extent our taxonomy could be applied to their data. The main difference between the Flickr30K-Entities labels and the Face2Text descriptions is that the former are part of a larger description, whereas the latter are full-blown descriptions themselves. As a result, the Face2Text descriptions are much longer (a mean of 26.9 tokens versus 2.4 for the Flickr30k-entities labels). This leads to crowd-workers providing much more (and seemingly more specific) information about the people in the images. For example, there are 24 occurrences of 'jaw' in Face2Text, with modifiers such as angular, pointy, traditional square to denote the specific shape of the jaw. Such details do not seem relevant enough to mention in a short label, as in the Flickr30K-Entities dataset.
In future work, we hope to extend our taxonomy to cover the Face2Text data. This would make users more aware of the contents of the corpus, and enable them to make a conscious choice about the kinds of features they would like their face description systems to generate.

Consistency is no substitute for truth
In earlier research, Song et al. (2017) present a system that is able to predict (to varying degrees of success) perceived social attributes from faces. Human participants rated faces from a large database for their attractiveness, friendliness, familiarity, but also to what extent they thought the subjects were egotistical, emotionally stable, or responsible. 4 It is important to stress that these ratings only indicate perceived characteristics, and do not necessarily reflect the actual characters of the individuals in the dataset. More generally, even though people may be able to consistently ascribe a particular property to an individual, this alone does not entail that the property actually applies (see Todorov et al. 2013; Agüera y Arcas et al. 2017 for a discussion). When considering different ways to label other people, we should ask ourselves: is it reasonable to predict this label category based on visual information alone?

Limitations
The approach taken in this paper has three main limitations, which we will discuss in turn.
First, our taxonomy is based on a subset of the person-labels in the Flickr30K-Entities dataset, and thus may overlook other relevant label categories.
We emphasize that our work is not meant to provide an exhaustive categorization of the labels used in the Flickr30K data. Rather, our goal is to highlight the breadth of the label distribution. The fact that the broad taxonomy developed in this paper is based on a subset of all the labels (less than a third of the Flickr30K data) only supports the main point of this paper, which is that humans use a wide array of terms to refer to other people.
Second, our taxonomy is constructed manually, and it is unclear whether replication would yield similar results. This is a natural result of a manual categorization of the person labels, and it would be interesting to see if we could automatically induce a similar taxonomy from the corpus data (for example using LDA; Blei et al. 2003). To facilitate future research in this area, we made all our code and data available online. 1 Finally, our taxonomy is exclusively based on English, without any input from other languages. It may be the case that speakers of other languages highlight other features, in making reference to other people. This idea opens up another avenue of research, asking two related questions: 1. Do speakers of the same language tend to mention the same person-attributes for the same images? 2. Are there any cross-linguistic differences in what features are mentioned in reference to other people?
Although some work has mentioned crosslinguistic differences in how annotators refer to other people (e.g. Li et al. 2016;van Miltenburg et al. 2017), we are not aware of any systematic study that specifically looks at how speakers of different languages make reference to other people, and what features they tend to mention.

Conclusion
We have looked at the variation in the ways crowdworkers talk about other people in the Flickr30K dataset. Our main result is that this variation covers a wide range of variables, from appearance to socio-economic status. We formalized this variation in a taxonomy of person-labels, which should help us reflect on the image description task, and the kinds of descriptions that image description systems should produce. Future research should be aware that, even though crowd-workers may systematically produce particular labels, this does not mean that the label is true. We encourage the development of standards and guidelines, that tell us which kinds of labels to use in what kind of situations. Such guidelines may benefit system evaluation and help us avoid the inappropriate labeling of other people.
Discovering the Endless Possibilities for Altering Your Everyday Realityâ€ as Want to Read: Want to Read savingâ€¦ Want to Read.Â When you get the psyche involved like this, and a bunch of people looking for alternate realities to believe in, you get What the Bleep Do We Know!? Valid experiments are twisted into accepting the paradox of Shrodinger's cat which was published for the purpose of pointing out that it is ridiculous to believe in such nonsense.Â They also talk about the quantum physics of cell microtubules (think of them as a cell's "skeleton"). I admit I'm not a microtubule expert, but I was under the impression that all human cells had these, not just neurons. Please, can you tell me, which one of the two options endless or unlimited goes with possibilities in the following sentence? There are so...Â You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser. endless/unlimited possibilities. Thread starter irgia3. Start date Feb 5, 2012. Around the world, people are both increasingly dependent on and distrustful of digital technology. Though, they donâ€™t behave as if they mistrust technology. Instead, people are using technological tools more intensively in all aspects of daily life.Â As a digital expert and enthusiast, I canâ€™t resist to claim and share a great world of possibilities we live in today. The digital landscape, despite its downsides has presented us with the ability to connect, share, collaborate and grow our different communities in ways we couldnâ€™t have possibly imagined. Itâ€™s upon us to engage and learn more about the power of Digital so that donâ€™t get left behind. Change is coming and we should be ready to embrace it. Read: Sony Launches New Cloud Remote for PS4. What would they talk about? They canâ€™t understand each otherâ€™s unique problems and canâ€™t relate to each otherâ€™s specific lifestyles. And what we are talking about here would be even more dividing. Itâ€™s like two men competing about facts knowing, but one of them is allowed to use Wikipedia on his phone and the other doesnâ€™t have a phone at all â€" who do you think will win?Â You can not only download other peopleâ€™s memories, but you can save your own memories, store them and replay whenever you want, alongside all emotions that you felt when you first experienced them. And yes, you can also edit a memory .Â Remember â€" right now we are thinking from the perspective of a tree, so we canâ€™t actually imagine all the possibilities. One thing is sure â€" the future is exciting!