Extracting Possessions from Social Media: Images Complement Language

This paper describes a new dataset and experiments to determine whether authors of tweets possess the objects they tweet about. We work with 5,000 tweets and show that both humans and neural networks benefit from images in addition to text. We also introduce a simple yet effective strategy to incorporate visual information into any neural network beyond weights from pretrained networks. Specifically, we consider the tags identified in an image as an additional textual input, and leverage pretrained word embeddings as usually done with regular text. Experimental results show this novel strategy is beneficial.


Introduction
Social media are platforms for sharing information online. Social media posts and online behavior in general (e.g., Facebook likes, following other users in Twitter) have been shown to predict human traits (Burke et al., 2010;Schwartz et al., 2013). Many social media posts include an image alongside text, and the percentage keeps growing as doing so boosts user engagement (Patel, 2016).
Pictures and text in social media usually complement each other. Thus, even if the information of interest can be understood from either communication modality, considering both is beneficial. Consider the tweet in Figure 1. The text indicates that Arnold (the author of the tweet) goes on bike rides when he travels. The image shows him riding a bike, indicating that he was riding a bike when he tweeted thus he was in possession of a bike. On the other hand, if the picture were a screenshot of his Twitter posting statistics, Arnold most likely would not be in possession of a bike when tweeting, but rather sharing a log of his previous trips with his followers. In this paper, we extract possession relations from social media posts containing both text and images. Possession is an asymmetric semantic relation between two entities, where one entity (the possessee) belongs to the other entity (the possessor) (Stassen, 2009). Following the literature, we consider not only ownership, but also control possessions. In control possessions, the possessor has temporary control of the possessee but not necessarily ownership (Tham, 2004), e.g., Bill borrowed the ozone generator from John.
While we do not explore any, extracting possessions has many potential applications. For example, possessions could help to reveal hobbies and to find people with similar interests. Possessions could also improve recommender systems. For example, people without cars are unlikely to be interested in oil changes and auto mechanics. Similarly, people who recently purchased a home may be interested in moving and remodeling services. Extracting possessions could also be useful to identify skills. For example, peo-ple who possess a bike are likely to be able to ride bikes, and those who have control possession of an 18-wheeler are typically able to drive large trucks. The main contributions of this paper are: (a) a corpus of 5,000 tweets (text and images) annotated with possession relations including type (alienable or control), temporal anchors with respect to the tweet timestamp, and interest in something, 1 (b) detailed corpus analysis showing, among others, that humans understand more possession relations when they have access to both the text and images, (c) experimental results showing that the task can be automated and features extracted from the images improve results. Regarding visual features, we show that incorporating weights from pretrained networks-a common practice in previous work-is beneficial, but we obtain more substantial improvements incorporating the objects and events identified in an image as an additional textual input and leveraging word embeddings.

Previous Work
Possession relations have primarily been studied in efforts targeting large relation repositories between arguments connected with some lexicosyntactic pattern. Tratz and Hovy (2013) work with 17 semantic relations realized by possessive constructions, Badulescu and Moldovan (2009) with 36 relations realized by genitives, and Nakov and Hearst (2013) and Tratz and Hovy (2010) target noun compounds. Blodgett and Schneider (2018) annotate 50 supersenses (including roles and relations between entities) for possessives. These efforts extract possessions from text, and target possessors and possessees connected by specific patterns. Unlike them, we extract possessions using both text and images. In addition to possession existence, we also extract types, temporal anchors, and interest in the possessee.
Two recent efforts target possession relation extraction from text without strict syntactic constraints. In our previous work, we extract intrasentential possessions from OntoNotes (Chinnappa and Blanco, 2018). In the work described here, we use the list of synsets from our previous work to select possessees (Section 3). Banea and Mihalcea (2018) work with blogs and annotate possession existence at the time of utterance. Unlike these previous works, we (a) leverage both 1 Available at dhivyachinnappa.com text and images, (b) work with informal tweets (instead of standard English), (c) temporally anchor possessions before, during and after the tweet timestamp, and (d) also extract whether somebody has an interest in a concrete object regardless of possession existence. Using multiple modalities (e.g., text and images) to better solve some task is not new. Among many others, Specia et al. (2016) propose multimodal machine translation and Moon et al. (2018) show that named entity recognition benefits from taking into account both text and images. Our innovation is twofold. First, we show that humans understand more possession information when they have access to the image accompanying a text, as opposed to only reporting improvements on (automatically) solving some task. Second, our neural image component includes two subcomponents. The first one-weights from InceptionNet-is common in previous work, but the second one is novel. Specifically, the second component considers the objects and events identified in an image as an additional textual input. This allows us to leverage pretrained word embeddings and recurrent neural networks, a strategy that we prove beneficial.

A Corpus of Possession Relations
We start with a collection of English tweets consisting of text and images (Hu et al., 2018). First, we discard tweets that do not contain I, me, my, or mine in order to maximize the amount of tweets published by individuals and avoid tweets by organizations as well as advertisements. Second, we select as potential possessors the authors of tweets, and as potential possesses the nouns subsumed by the WordNet synsets (Miller, 1995) proposed in previous work (Section 2) except the following nouns: fan, filter, launch, mini, release and safe. We eliminate them because they almost never yield possession relations in social media. For example, fan almost always refers to a person (e.g., This Bucks fan put on a show) instead of to an "apparatus with rotating blades," and filter almost always refers to a photo effect (e.g., Bare face plus a snap filter) instead of to a "porous device for removing impurities." Finally, we randomly select 5,000 possessor-possessee pairs.

Annotation Tasks And Guidelines
In addition to possession existence (i.e., whether the potential possessor possesses the potential pos- sessee), we also annotate possession type (alienable or control), temporal anchors with respect to the tweet timestamp (before, during and after), and whether the potential possessor has an interest in the potential possessee regardless of possession existence and possession type. Possession Existence. The first annotation task is to determine whether the potential possessor (x) possesses the potential possessee (y). Annotators choose between the following labels: • yes if a possession exists (i.e., x possesses y) at some point of time; • never if a possession does not exist (i.e., x does not possess y) at any point of time; or • unk (unknown) if it is sound to ask whether x possesses y, but there is not enough information to choose yes or never. Possession Type. If a possession relation exists (existence: yes), annotators also indicate the type: • alienable: if x can be separated from y and x is the owner of y, regardless of spatial proximity or other variables; or • control: if x can be separated from y and x has control over y, regardless of ownership, spatial proximity or other variables. Note that according to these definitions, control possessions, unlike alienable possessions, do not require ownership. For example, people driving a rental car have control possession of the car but not alienable possession. Control possession and alienable possession are mutually exclusive labels. We study possession types (alienable and control) to understand the strength of the possession relation between the possessor and the possessee. We do not consider inalienable possessions because they are uncommon in social media. Temporal Anchors. If a possession relation exists (existence: yes), annotators also indicate when it is true with respect to the tweet timestamp: • before yes or no: whether x possesses y the day before tweeting or earlier; • during yes or no: whether x possesses y the day he tweeted; and • after yes or no: whether x possesses y the day after tweeting or later. Interest in the Possessee. Finally, annotators also indicate whether x has an interest in y (interest yes or interest no), regardless of the labels for possession existence and type. Interest does not entail past, current or future possession existence. It indicates that x shows curiosity or excitement about y. Let us consider John Doe and a tweet about eating more vegetables (with a fork) because of a doctor's recommendation. In this context, John would have possession of the fork but no interest in it.

Annotation Set-Up and Quality
Annotations were done in-house by two graduate students. We developed a simple interface that showed one tweet at a time, and annotators were instructed to use world knowledge and common sense. In order to minimize biases, the interface did not show the twitter handle, profile picture or any other user information. In addition, we sampled 200 tweets and only in 7% of them the image possibly provided information about the tweet author (e.g., gender, age, race, ethnicity).
In a first round of annotations, annotators had access only to the text in the tweet. In a second round, they had access to both the text and the image. Our rationale is that we are interested in comparing human judgments depending on whether the image is available or not (Section 4.1). Inter-Annotator Agreement. Table 1 shows inter-annotator agreements (observed and Cohen's κ) when annotators have access to (a) only the text and (b) the text and image. Agreements are very similar regardless of whether the image is available, and as we shall see (Section 4.1), doing so results in more possession information. Cohen's   κ for possession existence and type (first row) is 0.82 with text and images, it ranges from 0.78 to 0.81 for temporal anchors (rows 2-4), and it is 0.78 for interest in the possessee (row 5). κ values between 0.60 and 0.80 are substantial, and above 0.80 nearly perfect (Artstein and Poesio, 2008).

Corpus Analysis
We start describing the final corpus, which was annotated with both text and images. Then, we compare the corpus obtained when annotators have access to (a) only the text and (b) the text and image.
Label Distributions. Table 2 shows the label distributions for the three annotation tasks: possession existence, possession type (yes: alienable or control) and interest in the possessee. Overall, the percentage of unknown label (unk) is low (15.5%), indicating that possession existence can almost always be determined. More importantly, the procedure described in Section 3 allows us to reveal useful knowledge in 84.5% of all generated possessor-possessee pairs (alienable, control and never). Most possessions are alienable (83.6%, 38.4% of all possessor-possesee pairs) and the percentage of control possession is low. The never label is somewhat common (38.6), indicating that people often tweet about objects that they have never possessed.   Regarding interest, the possessor has an interest in the possessee in 40.2% of the 5,000 generated pairs. Figure 2 provides insights regarding possession existence and interest for the generated pairs. First, the possessor has an interest in the possessee in (a) 35% of pairs for which possession could not be determined (unk) and (b) 15% of pairs for which no possession exists (never). Second, regardless of possession type, the percentage of interest yes remains at approximately 60%.
The distributions of temporal anchor labels (before, during and after) per possession type (Table 3) show that possession type substantially influences when the possession is true with respect to the tweet timestamp. Regarding alienable possessions, people tweet more about what they own or will own in the future than what they owned in the past (95.9% and 90.5% vs. 79.5%). Control possessions show mostly uniform distribution with anchor before, and are unlikely to be true the day after tweeting (29.5%).
Finally, Figure 3 presents the distribution of possession existence, possession type and interest in the possessee depending on the WordNet synset of the possessee. Regarding possession existence and possession type labels (left columns), we observe similar distributions across synsets, although  devices (e.g., watch, guitar, cell phone) yield more possessions (alienable and control labels) and most of the possessions are alienable. In other words, people tend to tweet about devices they own, and rarely about devices they only have control over. Regarding interest in the possessee (right columns), people are slightly more likely to have an interest if the possessee is a device.
Annotation Examples. We present annotation examples in Table 4. Note that unlike in these examples, the annotation interface did not show the Twitter handle and profile picture in an effort to minimize potential biases (Section 3.2).
In Example (a), annotators understood that the author of the tweet was a competitive biker (bike stunt, racing flag), and world knowledge tells us that competitive bikers own their bikes. Thus, the author had an alienable possession with the bike. The text clearly indicates that the possession was true in the past (2 years ago), and hints that the author has an interest in bikes (miss being on X).
Example (b) is a straightforward example of control possession: the author does not own the jacket. While weekends last for two days, it is unknown when the author tweeted, so annotators chose only during temporal anchor. Additionally, neither the text or image indicate that the author has any interest in the jacket.
Example (c) illustrates an alienable possession in which the author possesses the possessee (i.e., the bag) before, during and after tweeting. While there is no specific cue indicating that the author will own the bag for an extended period of time, common sense indicates so. Additionally, the text (my cutest bag ever) indicates that the author is excited and has an interest in the bag.
Example (d) illustrates never label. In this case, the author is talking about the baby's sun-   Table 6: Examples of tweets which are annotated different depending on whether annotators have access to only the text (first row of labels, T) or the text and image (second row of labels, T+I). Note that the image allows to annotate more possessions (Examples (a) and (b)) as well as fix mistakes (Example (c)).

glasses. Additionally, there is no indication of the author having an interest about the sunglasses.
Examples (e, f) illustrate unk and never labels. The author of tweet (d) is sharing his favorite boots, and there is not enough information to determine whether she owns any. The author is, however, interested in boots, as she went through the task of choosing her favorite boot picks. Finally, in Example (f), pants cannot be a possessee as it is part of the name of a movie character.

Text vs. Text and Image
Annotators chose different labels depending on whether they had access to (a) only the text or (b) the text and image. Table 5 summarizes the changes in annotations. Most labels remain the same (alienable: 80.5%, control: 82.8%, never: 83.3%), however, most instances labeled unk when annotators have access only to text (74.3%) become alienable (38.7%), control (12.2%), or never (23.4%) when they also have access to the image (last column). Table 6 shows examples of changes in annotation. Given only the text in Tweet (a), it appears that the author dislikes denim. Looking at the image, however, it becomes clear that the author is being sarcastic and not only owns denim clothes but actually has a strong interest in denim. The text in Tweet (b) is basically two quotes, and looking only at the text one cannot determine whether the author owns any candles. Taking into account the picture, however, one can conclude that the author does own a candle (the picture illustrates the advice from the quote) although she does not have an interest in it. The last example, Tweet (c), illustrates how images also help discarding possessions that appear obvious from the text. The hat is not an actual object (it is a drawing on top of the picture) thus no alienable possession exists.

Experiments and Results
We experiment primarily with neural networks. Regarding libraries, we use Keras (Chollet et al., 2015) with TensorFlow as a backend (Abadi et Figure 4: Neural network architecture to predict possession existence, type, temporal anchors and interest. We include a text component (above dotted line) and two image components (below dotted line). Note that the top 5 tags from the Vision API become a textual input, and we use pretrained word embeddings and an LSTM for them. 2015). Each possessor-possessee pair (and corresponding tweet) becomes an instance, and we create stratified training (80%) and test (20%) sets. We train the neural network for up to 200 epochs using the Adam optimizer (Kingma and Ba, 2014), categorical cross entropy, and batch size 32. We stop the training process before 200 epochs if no improvement occurs in the validation set (15% of the training set) for 10 epochs. More specifically, we train six classifiers. The first classifier predicts possession existence (yes, never or unk). The second classifier predicts possession types, i.e., classifies pairs between which a possession exists (yes) into alienable or control. The third, fourth and fifth classifiers predict temporal anchors, i.e., classify pairs between which a possession holds-either alienable or controlinto before yes or before no, during yes or during no, and after yes or after no. Finally, the sixth classifier predicts interest in the possessee (interest yes or interest no). 2 Figure 4 shows the neural network architecture, which includes components for the text and image (above and below dotted line respectively). Text Component. The text component is an LSTM (Hochreiter and Schmidhuber, 1997). Each token is represented with the concatenation of three embeddings. The first two are Glove word embeddings pretrained with Common Crawl and Twitter (Pennington et al., 2014). The third embedding only takes two possible values (dark grey: possessee, white: non-possessee) and it is used to indicate the potential possessee. Only the addi-2 Code available at dhivyachinnappa.com tional embeddings are tuned along with other network parameters. Intuitively, the additional embedding allows the LSTM to focus on the context surrounding the potential possessee. Image Component. The image component leverages two pretrained neural networks: Inception-Net (Szegedy et al., 2015) and Cloud Vision API. 3 Generally speaking, InceptionNet is pretrained to identify objects, and the Vision API outputs tags describing images including not only objects but also events (e.g., cycling, recreation, travel from the image in the tweet in Figure 4). Regarding In-ceptionNet, we follow previous work (Section 2) and include the weights of the average pooling layer (second to last layer). This incorporates features (real numbers) capturing characteristics of the image to the output Softmax layer, where the features are useful for object prediction. More interestingly, we also incorporate the top 5 tags identified in the image by the Cloud Vision API. The main novelty of our architecture is the strategy to incorporate them. Rather than using one-hot encodings or training special-purpose embeddings, we consider the top 5 tags as an additional textual input and leverage pretrained GloVe word embeddings and an LSTM. Word embeddings allow us to bring meaning to the tags. Intuitively, this is more beneficial than incorporating weights from InceptionNet because the embeddings are a distributed representation of word meaning, and they are useful to, among others, determining word similarity and solving analogies (Pennington et al., 2014). The LSTM is useful because tags may be more than one token (e.g., whipped cream, electronic device) thus the top 5 . 52 .50 .46 .52 .51 .50 .53 .51 .50 .57 .53 .52 .79 .87 .83 interest yes

Experimental Results
Tables 7 and 8 present the experimental results (Precision, Recall and F1-score). We present results per label and the macro average. Baselines. We use the majority baseline and logistic regression using bag-of-words features (not shown, F1-scores are 0.48 (existence), 0.56 (types) and 0.58 (interest)). The full LSTM (NN text + img) outperforms the baselines predicting possession existence and types (existence F1: 0.66 vs. 0.21-0.64; types F1: 0.83 vs. 0.46-0.52)), but all models except the majority baseline perform roughly the same predicting interest in the possessee (F1: 0.58-0.59). Table 7 presents the results obtained with four versions of the neural network: using (a) only the text component, (b) the text component and the weights from InceptionNet (text + IN), (c) the text component and the tags from the Vision API as an additional textual input (text + Itags), and (d) the full network (text + img). We also obtained results with only the image components, but do not report the results because they were much worse. Possession Existence. All variations of the neural network outperform the baselines (logistic regression obtains 0.48 F1, not shown). Weights from InceptionNet do not bring any improvement by themselves, but the tags from the Vision API used as an additional textual input do (F1: 0.64 vs. 0.60). More importantly, combining both of them yields 10% improvement (F1: 0.66 vs. 0.60). Further examination revealed that this is due to leveraging pretrained word embeddings with the tags from the Vision API-using one-hot encodings does not bring improvements (not shown). Possession Type and Interest in the Possessee. Regarding possession type, we observe a similar trend than with possession existence. This time, however, the differences in results are larger (F1: 0.50 vs. 0.83) and the network with both image components (NN text + img) is the only model predicting control reliably (0.77 vs. 0.10-0.14).

Neural Network
Regarding interest in the possessee, all models but the majority baseline (including logistic re-gression) obtain similar F1s (0.58-0.59). While there is certainly room for improvement, the current results lead to the conclusion that a few keywords are sufficient to obtain 0.58 F1: neither images nor word embeddings bring improvements. Temporal Anchors. Table 8 presents results obtained with the neural network when predicting temporal anchors. The image components are beneficial with all anchors, especially before (F1: 0.47 vs. 0.59, +25%) and after (0.55 vs. 0.67, +22%), and to a lesser degree during (0.48 vs. 0.52; 8%). F1 scores are higher for yes label than no label across all temporal anchors.

Conclusions
We have presented a corpus of 5,000 tweets and experimental results to extract possession relations. Specifically, we work with text and images in order to reveal the possesees of the author of a tweet. Beyond possession existence, we also consider possession type, temporal anchors with respect to the tweet timestamp, and whether the author has an interest in the potential possessor regardless of possession existence.
The corpus analysis shows that humans understand more possessions when they have access to both the text and images. Authors of tweets often have an interest in potential possessees when there is no possession relation or there is not enough information to determine whether a possession exists (never and unk labels). Finally, experimental results show that incorporating pretrained networks for object identification and image understanding complement neural components that consider text. Crucially, we show that considering the top 5 tags identified in images (objects and events) as an additional textual input and leveraging word embeddings and recurrent neural networks yields better results than incorporating only weights from intermediate layers, as previous work does.