Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts

Text in social media posts is frequently accompanied by images in order to provide content, supply context, or to express feelings. This paper studies how the meaning of the entire tweet is composed through the relationship between its textual content and its image. We build and release a data set of image tweets annotated with four classes which express whether the text or the image provides additional information to the other modality. We show that by combining the text and image information, we can build a machine learning approach that accurately distinguishes between the relationship types. Further, we derive insights into how these relationships are materialized through text and image content analysis and how they are impacted by user demographic traits. These methods can be used in several downstream applications including pre-training image tagging models, collecting distantly supervised data for image captioning, and can be directly used in end-user applications to optimize screen estate.


Introduction
Social media sites have traditionally been centered around publishing textual content. Recently, posting images on social media has become a very popular way of expressing content and feelings especially due to the wide availability of mobile devices and connectivity. Images are currently present in a significant fraction of tweets and tweets with images get double the engagement of those without (Buffer, 2016). Thus, in addition to text, images have become key components of tweets.
However, little is known about how textual content is related to the images with which they appear. For example, concepts or feelings mentioned in text could be illustrated or strengthened by images, text can point to the content of an image or This is what happens when you lock your bike to a sign Awesome! can just provide commentary on the image content. Formalizing and understanding the relationship between the two modalities -text and images -is useful in several areas: a) for NLP and computer vision research, where image and text data from tweets are used to developing data sets and methods for image captioning (Mitchell et al., 2012) or object recognition (Mahajan et al., 2018); b) for social scientists and psychologists trying to understand social media use; c) in browsers or apps where images that may not contain additional content in addition to the text would be replaced by a placeholder and displayed if the end-user desires to in order to op-timize screen space (see Figure 2). Figure 1 illustrates four different ways in which the text and image of the same tweet can be related: • Figures 1(a,b) show how the image can add to the semantics of the tweet, by either providing more information than the text (Figure 1a) or by providing the context for understanding the text ( Figure 1b); • In Figures 1(c,d), the image only illustrates what is expressed through text, without providing any additional information. Hence, in both of these cases, the text alone is sufficient to understanding the tweet's key message; • Figures 1(a,c) show examples of tweets where there is a semantic overlap between the content of the text and image: bike and sign in Figure 1a and tacos in Figure 1c; • In Figures 1(b,d), the textual content is not represented in the image, with the text being either a comment on the image's content (Figure 1b) or the image illustrating a feeling related to the text's content.
In this paper, we present a comprehensive analysis that focuses on the types of relationships between the text and image in a tweet. Our contributions include: • Defining the types of relationships between the text and the image of a social media post; • Building a data set of tweets annotated with text -image relationship type; 1 • Machine learning methods that use both text and image content to predict the relationship between the two modalities; • An analysis into the author's demographic traits that are related to usage preference of textimage relationship types; • An analysis of the textual features which characterize each relationship type.

Related Work
Task. The relationship between a text and its associated image was researched in a few prior studies. For general web pages, Marsh and Domas White (2003) propose a taxonomy of 49 relationship grouped in three major categories based on how similar is the image to the text ranging from little relation to going beyond the text, which forms the basis of one of our relationship dimen-sions. Martinec and Salway (2005) aim to categorize text-image relationships in scientific articles from two perspectives: the relative importance of one modality compared to the other and the logico-semantic overlap. Alikhani and Stone (2018) argue that understanding multimodal textimage presentation requires studying the coherence relations that organize the content. Even when a single relationship is used, such as captioning, it can be expressed in multiple forms such as telic, atelic or stative . Wang et al. (2014) use the intuition that text and images from microposts can be associated or not or depend on one another and use this intuition in a topic model that learns topics and image tags jointly. Jas and Parikh (2015) study the concept of image specificity through how similar to each other are multiple descriptions of that image. However, none of these studies propose any predictive methods for text-image relationship types.  annotate and train models on a recipe data set (Yagcioglu et al., 2018) for the relationships between instructional text and images around the following dimensions: temporal, logical and incidental detail. Chen et al. (2013) study text-image relationships using social media data focusing on the distinction between images that are overall visually relevant or non-relevant to the textual content. They build models using the text and image content that predict the relationship type (Chen et al., 2015). We build on this research and define an annotation scheme that focuses on each of the two modalities separately and look at both their semantic overlap and contribution to the meaning of the whole tweet.
Applications. Several applications require to be able to automatically predict the semantic textimage relationship in the data. Models for automatically generating image descriptions (Feng and Lapata, 2010;Ordonez et al., 2011;Mitchell et al., 2012;Vinyals et al., 2015;Lu et al., 2017) or predicting tags (Mahajan et al., 2018) are built using large training data sets of noisy imagetext pairs from sources such as tweets. Multimodal named entity disambiguation leverages visual context vectors from social media images to aid named entity disambiguation (Moon et al., 2018). Multimodal topic labeling focuses on generating candidate labels (text or images) for a given topic and ranks them by relevance (Sorodoc et al., 2017). Several resources of images paired with descriptive captions are available, which can be used to build similarity metrics and joint semantic spaces for text and images (Young et al., 2014). However, all these assume that the text an image represent similar concepts which, as we show in this paper, is not true in Twitter. Being able to classify this relationship can be useful for all above-mentioned applications.

Categorizing Text-Image Relationships
We define the types of semantic relationships that can exist between the content of the text and the image by splitting them into two tasks for simplicity. The first task is centered on the role of the text to the tweet's semantics, while the second focuses on the image's role. The first task -referred to as the text task in the rest of the paper -focuses on identifying if there is semantic overlap between the context of the text and the image.
This task is the defined using the following guidelines: 1. Some or all of the content words in the text are represented in the image (Text is represented) 2. None of the content words in the text are represented in the image (Text is not represented): • None of the content words are represented in the image, or • The text is only a comment about the content of the image, or • The text expresses a feeling or emotion about the content of the image, or • The text only makes a reference to something shown in the image, or • The text is unrelated to the image Examples for this task can be seen in Figure 1 by comparing Figures 1(a,c) The second task -referred to as the image task in the rest of the paper -focuses on the role of the image to the semantics of the tweet and aims to identify if the image's content contributes with additional information to the meaning of the tweet beyond the text, as judged by an independent third party. This task is defined and annotated using the following guidelines: 1. Image has additional content that represents the meaning of the text and the image (Image adds): • Image contains other text that adds additional meaning to the text, or • Image depicts something that adds information to the text or • Image contains other entities that are referenced by the text. 2. Image does not add additional content that represents the meaning of text+image (Image does not add). Examples for the image task can be seen in Combining the labels of the two binary tasks described above gives rise to four types of text-image relationships (Image+Text Task). All of the four relationship types are exemplified in Figure 1.

Data Set
To study the relationship between the text and image in the same social media post, we define a new annotation schema and collect a new annotated corpus. To the best of our knowledge, no such corpus exists in prior research.

Data Sampling
We use Twitter as the source of our data, as this source contains a high level of expression of thoughts, opinions and emotions (Java et al., 2007;Kouloumpis et al., 2011). It represents a platform for observing written interactions and conversations between users (Ritter et al., 2010).
The tweets were deliberately randomly sampled tweets from a list of users for which several of their socio-demographic traits are known, introduced in past research . This will enable us to explore if the frequency of posting tweets with a certain text-image relationship is different across socio-demographic groups.
We downloaded as many tweets as we could from these users using the Twitter API (up to 3,200 tweets/user per API limits). We decided to annotate only tweets from within the same time range (2016) in order to reduce the influence of potential platform usage changes with time. We filter out tweets that are not written in English using the langid.py tool (Lui and Baldwin, 2012).
In total, 2,263 users (out of the initial 4,132) have posted tweets with at least one image in the year 2016 and were included in our analysis. Our final data set contains 4,471 tweets.

Demographic Variables
The Twitter users from the data set we sampled have self-reported the following demographic variables through a survey: gender, age, education level and annual income. All users solicited for data collection were from the United States in order to limit cultural variation.
• Gender was considered binary 2 and coded with Female -1 and Male -0. All other variables are treated as ordinal variables. • Age is represented as a integer value in the 13-90 year old interval.
• Education level is coded as an ordinal variable with 6 values representing the highest degree obtained, with the lowest being 'No high school degree' (coded as 1) and the highest being 'Advanced Degree (e.g., PhD)' (coded as 6). • Income level is coded as on ordinal variable with 8 values representing the annual income of the person, ranging from '< $20,000' to '> $200,000').
For a full description of the user recruitment and quality control processes, we refer the interested reader to .

Annotation
We have collected annotations for text-image pairs from 4,471 tweets using the Figure Eight platform (formerly CrowdFlower). We annotate all tweets containing both text and image using two independent annotation tasks in order to simplify the task and not to prime annotators use the outcome of one task as a indicator for the outcome of the other.
For quality control, 10% of annotations were test questions annotated by the authors. Annotators had to maintain a minimum accuracy on test questions of 85% for the image task and 75% for the text task for their annotations to be valid.
Inter-annotator agreement is measured using Krippendorf's Alpha. The overall Krippendorfs Alpha is 0.71 for the image task, which is in the upper part of the substantial agreement band (Artstein and Poesio, 2008). We collect 3 judgments and use majority vote to obtain the final label to further remove noise. For the text task, we collected and aggregated 5 judgments as the Krippendorf's Alpha is 0.46, which is considered moderate agreement (Artstein and Poesio, 2008). The latter task was more difficult due to requiring specific world knowledge (e.g. a singer mentioned in a text also present in an image) or contained information encoded in hashtags or usernames which the annotators sometimes overlooked. The aggregated judgments for each task were combined to obtain the four class labels. The label distributions of the aggregated annotations are: a) Text is represented & Image adds: 18.5%; b) Text is represented & Image does not add: 21.9%; c) Text is not represented & Image adds: 25.6%; d) Text is not represented & Image does not add: 33.8%.

Methods
Our goal is to develop methods that are capable of automatically classifying the text-image relationship in tweets. We experiment with several methods which use information of four different types: demographics of the user posting the tweet, metadata from the tweet, the text of the tweet or the image of the tweet; plus a combination of them. The methods we use are described in this section.

User Demographics
User demographic features are the survey-based demographic information we have available for all users that posted the annotated tweets. The use of these traits is based on the intuition that different demographic groups have different posting preferences (Pennacchiotti and Popescu, 2011;. We use this approach for comparison reasons only, as in practical use cases we would normally not have access to the author's demographic traits.
We code the gender, age, education level and income level of the user as features and use them in a logistic regression classifier to classify the textimage relationship.

Tweet Metadata
We experiment with using the tweet metadata as features. These code if a tweet is a reply, tweet, like or neither. We also add as features the tweet like count, the number of followers, friends and posts of the post's author and include them all in a logistic regression classifier.
These features are all available at tweet publishing time and we build a model using them to establish a more solid baseline for content based approaches.

Text-based Methods
We use the textual content of the tweet alone to build models for predicting the text-image relationship. We expect that certain textual cues will be specific to relationships even without considering the image content. For example, tweets ending in an ellipsis or short comments will likely be predictive of the text not being represented in the image. Surface Features. We first use a range of surface features which capture more of the shallow stylistic content of the tweet. We extract number of tokens, uppercase tokens, exclamations, questions, ellipsis, hashtags, @ mentions, quotes and URLs from the tweet and use them as features in a logistic regression classifier. Bag of Words. The most common approach for building a text-based model is using bag-ofwords features. Here, we extract unigram and bigram features and use them in a logistic regression classifier with elastic net regularization (Zou and Hastie, 2005). LSTM. Finally, based on recent results in text classification, we also experiment with a neural network approach which uses a Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) network. The LSTM network processes the tweet sequentially, where each word is represented by its embedding (E = 200) followed by a dense hidden layer (D = 64) and by a a ReLU activation function and dropout (0.4) The model is trained by minimizing cross entropy using the Adam optimizer (Kingma and Ba, 2014). The network uses in-domain Twitter GloVe embeddings pre-trained on 2 billion tweets (Pennington et al., 2014).

Image-based Methods
We use the content of the tweet image alone to build models for predicting the text-image relationship. Similar to text, we expect that certain image content will be predictive of text-image relationships even without considering the text content. For example, images of people may be more likely to have in the text the names of those persons.
To analyze image content, we make use of large pre-trained neural networks for the task of object recognition on the ImageNet data set. ImageNet (Deng et al., 2009) is a visual database developed for object recognition research and consists of 1000 object types. In particular, we use the popular pre-trained InceptionNet model (Szegedy et al., 2015), which achieved the best performance at the ImageNet Large Scale Visual Recognition Challenge 2014 to build the following two imagebased models. ImageNet Classes. First, we represent each image in a tweet with the probability distribution over the 1,000 ImageNet classes obtained from Inception-Net. Then, we pass those features to a logistic regression classifier which is trained on our task. In this setup, the network parameters remain fixed, while only the ImageNet class weights are learned in the logistic regression classifier. Tuned InceptionNet. Additionally, we tailored the InceptionNet network to directly predict our tasks by using the multinomial logistic loss with softmax as the final layer for our task to replace the 1,000 ImageNet classes. Then, we loaded the pretrained network from (Szegedy et al., 2015) and fine-tuned the final fully-connected layer with the modified loss layers. We perform this in order to directly predict our task, while also overcoming the necessity of re-extracting the entire model weights from our restricted set of images.
The two approaches to classification using image content based on pre-trained model on Im-ageNet have been used successfully in past research (Cinar et al., 2015).

Joint Text-Image Methods
Finally, we combine the textual and image information in a single model to classify the text-image relationship type, as we expect both types of content and their interaction to be useful to the task. Ensemble. A simple method for combining the information from both modalities is to build an ensemble classifier. This is done with a logistic regression model with two features: the Bag of Words text model's predicted class probability and the Tuned InceptionNet model's predicted class probability. The parameters of the model are tuned by cross validation on the training data and similar splits as the individual models. LSTM + InceptionNet. We also build a joint approach by concatenating the features from the final layers of our LSTM and InceptionNet models and passing them through a fully-connected (FC) feed forward neural network with one hidden layer (64 nodes). The final output is our text-image relationship type. We use the Adam optimizer to fine tune this network. The LSTM model has the same parameters as in the text-only approach, while the InceptionNet model is initialized with the pre-trained model on the ImageNet data set.

Predicting Text-Image Relationship
We split our data into a 80% train (3,576 tweets) and 20% test (895 tweets) stratified sample for all of our experiments. Parameters were tuned using 10-fold cross-validation with the training set, and results are reported on the test set. Table 1 presents the weighted F1-scores for the text task, the image task and the image+text task with all the methods described in Section 5. The weighted F1 score is the weighted average of the class-level F1 scores, where the weight is the number of items in each class.
The majority baseline always predicts the most frequent class in each task, namely: Image does not add for the image task, Text is not represented for the text task and Image does not add & Text is not represented for the Image + Text task.
The models using user demographics and tweet metadata show minor improvements over the majority class baseline for both tasks. When the two tasks are combined, both feature types offer only a slight increase over the baseline. This shows that user factors mildly impact the frequency with which relationship types are used, which will be explored further in the analysis section.
The models that use tweet text as features show consistent improvements over the baseline for all three tasks. The two models that use the tweet's topical content (Bag of Words and LSTM) obtain higher predictive performance over the surface features. Both content based models obtain relatively similar performance, with the LSTM performing better on the image task.
The models which use information extracted from the image alone also consistently outperform the baseline on all three tasks. Re-tuning the neural network performs substantially better than building a model directly from the ImageNet classes on the image task and narrowly outperforms the other method on the text task. This is somewhat expected, as the retuning is performed on this domain specific task.
When comparing text and image based models across tasks, we observe that using image features obtains substantially better performance on the image task, while the text models obtain bet-ter performance on the text task. This is somewhat natural, as the focus of each annotation task is on one modality and methods relying on content from that modality are more predictive alone as to what ultimately represents the text-image relationship type.
Our naive ensemble approach does not yield substantially better results than the best performing methods using a single modality. However, by jointly modelling both modalities, we are able to obtain improvements -especially on the image task. This shows that both types of information and their interaction are important to this task. Methods that exploit more heavily the interaction and semantic similarity between the text and the image are left for future work.
We also observe that the predictive methods we described are better at classifying the image task. The analysis section below will allow us to uncover more about what type of content characterizes each relationship type.

Analysis
In this section, we aim to gain a better understanding of the type of content specific of the four textimage relationship types and about user type preferences in their usage.
To measure this, we use partial Pearson correlation where the dependent variables are one of four socio-demographic traits described in Section 4.2. The independent variables indicate the average times with which the user employed a certain relationship type. We code this using six different variables: two representing the two broader tasks -the percentage of tweets where image adds information and the percentage of tweets where the text is represented in the image -and four encoding each combination between the two tasks.
In addition, for all analyses we consider gender and age as basic human traits and control for data skew by introducing both variables as controls in partial correlation, as done in prior work (Schwartz et al., 2013;Holgate et al., 2018). When studying age and gender, we only use the other trait as the control. Because we are running several statistical tests at once (24) without predefined hypotheses, we use Bonferroni correction to counteract the problem of multiple comparisons. The results are presented in Table 2.
We observe that age is the only user demographic trait that is significantly correlated to text-image relationship preference after controlling for multiple comparisons and other demographic traits. The text-image relationship where the text is represented in the image, at least partially, is positively correlated with age (r = 0.117).
Further analyzing the four individual text-image relationship types reveals that older users especially prefer tweets where there is a semantic overlap between the concepts present in the text and the image, but the image contributes with additional information to the meaning of the tweet. This is arguably the most conventional usage of images, where they illustrate the text and provide more details than the text could.
Younger users prefer most tweets where the image adds information to the meaning of the tweet, but this has no semantic overlap with the text. These are usually tweets where the text represents merely a comment or a feeling expressed with the image providing the context. This represents a more image-centric approach to the meaning of the tweet that is specific to younger users. These correlations are controlled for gender.
Education was also correlated with images where the text was represented in the image (r = 0.076, p < .01, Bonferroni corrected), but this correlation did not meet the significance criteria when controlled for age to which education is moderately correlated (r = 0.302). This demonstrates the importance of controlling for such factors in this type of analysis. No effects were found with respect to gender or income.  Table 2: Pearson correlation between user demographic traits and usage of the different text-image relationship types. All correlations in bold are significant at p < .01, two-tailed t-test, Bonferroni corrected for multiple comparisons. Results for gender are controlled for age and vice versa. Results for education and income are controlled for age and gender.

Tweet Metadata Analysis
We adapt a similar approach to uncover potential relationships between the text-image relationship expressed in the tweet and tweet metadata features described in Section 5.2. However, after controlling for multiple comparisons, we are left with no significant correlations at p < 0.01 level. Hence, we refrain from presenting and discussing any results using this feature group as significant.

Text Analysis
Finally, we aim to identify the text and image features that characterize the four types of text-image relationship.
We use univariate Pearson correlation where the independent variable is each feature's normalized value in a tweet and the dependent variables are two binary indicators for the text and image tasks respectively. When performed using text features, this technique was coined Differential Language Analysis (Schwartz et al., 2013(Schwartz et al., , 2017. The results when using unigrams as features are presented in Figure 3, 4 and 5. Results for the image task ( Figure 3) show that the image adds to the meaning of the tweet if words such as this, it, why, needs or want are used. These words can appear in texts with the role of referencing or pointing to an entity which is only present in the image.
Conversely, the image does not add to the meaning of the tweet when words indicative of objects that are also described in the image are present (cat, baby, eyes or face), thus resulting in the image not adding to the meaning of the tweet. A special case are tweets with birthday wishes, where a person is mentioned in text and also displayed in an image. Finally, the tbt keyword and hashtag is a popular social media trend where users post nostalgic pictures of their past accompanied by their textual description.
The comparison between the two outcomes of the text task is presented in Figure 4. When the text and image semantically overlap, we observe words indicative of actions (i've), possessions (your) or qualitative statements (congrats, loved, excited, tried), usually about objects or persons also present in the image. We also observe a few nouns (cats, hillary) indicating frequent content that is also depicted in images (NB: the tweets were collected in 2016 when the U.S. presiden-tial elections took place). Analyzing this outcome jointly with the text task, we uncover a prominent theme consisting of words describing first person actions (congrats, thank, i've, saw, tell) present when the image provides facets not covered by text (Figure 5d). Several keywords from text (cat, game, winter) show types of content which are present in both image and text, but the image is merely an illustrating these concepts without adding additional information (Figure 5a).
In contrast, the text is not represented in the image when it contains words specific of comments (when, lmao), questions (do, was), references (this) or ellipsis ('...'), all often referencing the content of the image as identified through data inspection. References to self, objects and personal states (i, me) and feelings (miss) are also expressed in text about items or things not appearing the image from the same tweet. Further exploring this result though the image task outcome, we see that the latter category of feelings about persons of objects ( Figure 5a) -miss, happy, lit, like) are specific of when the image does not add additional information. Through manual inspection of these images, they often display a meme (as in Figure 1d) or unrelated expressions to the text's content. The image adds information when the text is not represented (Figure 5c) if the latter includes personal feelings, (me, i, i'm, want), comments (lol, lmao) and references (this, it), usually related to the image content as identified through an analysis of the data.

Conclusions
We defined and analyzed quantitatively and qualitatively the semantic relationships between the text and the image of the same tweet using a novel annotated data set. The frequency of use is influenced by the age of the poster, with younger users employing images with a more prominent role in the tweet, rather than just being redundant to the text or as a means of illustrating it. We studied the correlation between the content in the text and relation with the image, highlighting a differentiation between relationship types, even if only using the text of the tweet alone. We developed models that use both text and image features to classify the text-image relationship, with especially high performance (F1 = 0.81) in identifying if the image is redundant, which is immediately useful for downstream applications that maximize screen es-tate for users.
Future work will look deeper into using the similarity between the content of the text and image (Leong and Mihalcea, 2011), as the text task results showed room for improvements. We envision that our data, task and classifiers will be useful as a preprocessing step in collecting data for training large scale models for image captioning (Feng and Lapata, 2010) or tagging (Mahajan et al., 2018) or for improving recommendations (Chen et al., 2016) by filtering out tweets where the text and image have no semantic overlap or can enable new tasks such as identifying tweets that contain creative descriptions for images.