Multimodal Emoji Prediction

Emojis are small images that are commonly included in social media text messages. The combination of visual and textual content in the same message builds up a modern way of communication, that automatic systems are not used to deal with. In this paper we extend recent advances in emoji prediction by putting forward a multimodal approach that is able to predict emojis in Instagram posts. Instagram posts are composed of pictures together with texts which sometimes include emojis. We show that these emojis can be predicted by using the text, but also using the picture. Our main finding is that incorporating the two synergistic modalities, in a combined model, improves accuracy in an emoji prediction task. This result demonstrates that these two modalities (text and images) encode different information on the use of emojis and therefore can complement each other.


Introduction
In the past few years the use of emojis in social media has increased exponentially, changing the way we communicate. The combination of visual and textual content poses new challenges for information systems which need not only to deal with the semantics of text but also that of images. Recent work (Barbieri et al., 2017) has shown that textual information can be used to predict emojis associated to text. In this paper we show that in the current context of multimodal communication where texts and images are combined in social networks, visual information should be combined with texts in order to obtain more accurate emojiprediction models.
We explore the use of emojis in the social media platform Instagram. We put forward a multimodal approach to predict the emojis associated to an In-stagram post, given its picture and text 1 . Our task and experimental framework are similar to (Barbieri et al., 2017), however, we use different data (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a post. We show that a multimodal approach (textual and visual content of the posts) increases the emoji prediction accuracy compared to the one that only uses textual information. This suggests that textual and visual content embed different but complementary features of the use of emojis.
In general, an effective approach to predict the emoji to be associated to a piece of content may help to improve natural language processing tasks (Novak et al., 2015), such as information retrieval, generation of emoji-enriched social media content, suggestion of emojis when writing text messages or sharing pictures online. Given that emojis may also mislead humans (Miller et al., 2017), the automated prediction of emojis may help to achieve better language understanding. As a consequence, by modeling the semantics of emojis, we can improve highly-subjective tasks like sentiment analysis, emotion recognition and irony detection (Felbo et al., 2017).

Dataset and Task
Dataset: We gathered Instagram posts published between July 2016 and October 2016, and geolocalized in the United States of America. We considered only posts that contained a photo together with the related user description of at least 4 words and exactly one emoji.
Moreover, as done by Barbieri et al. (2017), we considered only the posts which include one and only one of the 20 most frequent emojis (the most frequent emojis are shown in Table 3). Our dataset is composed of 299,809 posts, each containing a picture, the text associated to it and only one emoji. In the experiments we also considered the subsets of the 10 (238,646 posts) and 5 most frequent emojis (184,044 posts) (similarly to the approach followed by Barbieri et al. (2017)).
Task: We extend the experimental scheme of Barbieri et al. (2017), by considering also visual information when modeling posts. We cast the emoji prediction problem as a classification task: given an image or a text (or both inputs in the multimodal scenario) we select the most likely emoji that could be added to (thus used to label) such contents. The task for our machine learning models is, given the visual and textual content of a post, to predict the single emoji that appears in the input comment.

Models
We present and motivate the models that we use to predict an emoji given an Instagram post composed by a picture and the associated comment.

ResNets
Deep Residual Networks (ResNets) (He et al., 2016) are Convolutional Neural Networks which were competitive in several image classification tasks (Russakovsky et al., 2015;Lin et al., 2014) and showed to be one of the best CNN architectures for image recognition. ResNet is a feedforward CNN that exploits "residual learning", by bypassing two or more convolution layers (like similar previous approaches (Sermanet and Le-Cun, 2011)). We use an implementation of the original ResNet where the scale and aspect ratio augmentation are from (Szegedy et al., 2015), the photometric distortions from (Howard, 2013) and weight decay is applied to all weights and biases (instead of only weights of the convolution layers). The network we used is composed of 101 layers (ResNet-101), initialized with pretrained parameters learned on ImageNet (Deng et al., 2009). We use this model as a starting point to later finetune it on our emoji classification task. Learning rate was set to 0.0001 and we early stopped the training when there was not improving in the validation set.

FastText
Fastext (Joulin et al., 2017) is a linear model for text classification. We decided to employ FastText as it has been shown that on specific classification tasks, it can achieve competitive results, comparable to complex neural classifiers (RNNs and CNNs), while being much faster. FastText represents a valid approach when dealing with social media content classification, where huge amounts of data needs to be processed and new and relevant information is continuously generated. The Fast-Text algorithm is similar to the CBOW algorithm (Mikolov et al., 2013), where the middle word is replaced by the label, in our case the emoji. Given a set of N documents, the loss that the model attempts to minimize is the negative log-likelihood over the labels (in our case, the emojis): where e n is the emoji included in the n-th Instagram post, represented as hot vector, and used as label. A and B are affine transformations (weight matrices), and x n is the unit vector of the bag of features of the n-th document (comment). The bag of features is the average of the input words, represented as vectors with a look-up table.

B-LSTM Baseline
Barbieri et al. (2017) propose a recurrent neural network approach for the emoji prediction task. We use this model as baseline, to verify whether FastText achieves comparable performance. They used a Bidirectional LSTM with character representation of the words (Ling et al., 2015;Ballesteros et al., 2015) to handle orthographic variants (or even spelling errors) of the same word that occur in social media (e.g. cooooool vs cool).

Experiments and Evaluation
In order to study the relation between Instagram posts and emojis, we performed two different experiments. In the first experiment (Section 4.2) we compare the FastText model with the state of the art on emoji classification (B-LSTM) by Barbieri et al. (2017). Our second experiment (Section 4.3) evaluates the visual (ResNet) and textual (FastText) models on the emoji prediction task. Moreover, we evaluate a multimodal combination of both models respectively based on visual and  Barbieri et al. (2017), using the same Twitter dataset. textual inputs. Finally we discuss the contribution of each modality to the prediction task.
We use 80% of our dataset (introduced in Section 2) for training, 10% to tune our models, and 10% for testing (selecting the sets randomly).

Feature Extraction and Classifier
To model visual features we first finetune the ResNet (process described in Section 3.1) on the emoji prediction task, then extract the vectors from the input of the last fully connected layer (before the softmax). The textual embeddings are the bag of features shown in Section 3.2 (the x n vectors), extracted after training the FastText model on the emoji prediction task.
With respect to the combination of textual and visual modalities, we adopt a middle fusion approach (Kiela and Clark, 2015): we associate to each Instagram post a multimodal embedding obtained by concatenating the unimodal representations of the same post (i.e. the visual and textual embeddings), previously learned. Then, we feed a classifier 2 with visual (ResNet), textual (FastText), or multimodal feature embeddings, and test the accuracy of the three systems.

B-LSTM / FastText Comparison
To compare the FastText model with the word and character based B-LSTMs presented by Barbieri et al. (2017), we consider the same three emoji prediction tasks they proposed: top-5, top-10 and top-20 emojis most frequently used in their Tweet datasets. In this comparison we used the same Twitter datasets. As we can see in Table 1 FastText model is competitive, and it is also able to outperform the character based B-LSTM in one of the emoji prediction tasks (top-20 emojis Table 2: Prediction results of top-5, top-10 and top-20 most frequent emojis in the Instagram dataset: Precision (P), Recall (R), F-measure (F1). Experimental settings: majority baseline, weighted random, visual, textual and multimodal systems. In the last line we report the percentage improvement of the multimodal over the textual system.

Multimodal Emoji Prediction
We present the results of the three emoji classification tasks, using the visual, textual and multimodal features (see Table 2). The emoji prediction task seems difficult by just using the image of the Instagram post (Visual), even if it largely outperforms the majority baseline 3 and weighted random 4 . We achieve better performances when we use feature embeddings extracted from the text. The most interesting finding is that when we use a multimodal combination of visual and textual features, we get a nonnegligible improvement. This suggests that these two modalities embed different representations of the posts, and when used in combination they are synergistic. It is also interesting to note that the more emojis to predict, the higher improvement the multimodal system provides over the text only system (3.28% for top-5 emojis, 7.31% for top-10 emojis, and 13.42 for the top-20 emojis task).

Qualitative Analysis
In Table 3 we show the results for each class in the top-20 emojis task.
The emoji with highest F1 using the textual features is the most frequent one (0.62) and the US flag (0.52). The latter seems easy to predict since it appears in specific contexts: when the word USA/America is used (or when American cities are referred, like #NYC).
The hardest emojis to predict by the text only system are the two gestures (0.12) and (0.13). The first one is often selected when the gold stan-3 Always predict since it is the most frequent emoji. 4 Random keeping labels distribution of the training set  Table 3: F-measure in the test set of the 20 most frequent emojis using the three different models. "%" indicates the percentage of the class in the test set dard emoji is the second one or is often mispredicted by wrongly selecting or .
Another relevant confusion scenario related to emoji prediction has been spotted by Barbieri et al. (2017): relying on Twitter textual data they showed that the emoji was hard to predict as it was used similarly to . Instead when we consider Instagram data, the emoji is easier to predict (0.23), even if it is often confused with .
When we rely on visual contents (Instagram picture), the emojis which are easily predicted are the ones in which the associated photos are similar. For instance, most of the pictures associated to are dog/pet pictures. Similarly, is predicted along with very bright pictures taken outside. is correctly predicted along with pictures related to gym and fitness. The accuracy of is also high since most posts including this emoji are related to fitness (and the pictures are simply either selfies at the gym, weight lifting images, or protein food).
Employing a multimodal approach improves performance. This means that the two modalities are somehow complementary, and adding visual information helps to solve potential ambiguities that arise when relying only on textual content. In Figure 1 we report the confusion matrix of the multimodal model. The emojis are plotted from the most frequent to the least, and we can see that the model tends to mispredict emojis selecting more frequent emojis (the left part of the matrix is brighter).

Saliency Maps
In order to show the parts of the image most relevant for each class we analyze the global average pooling (Lin et al., 2013) on the convolutional Figure 1: Confusion matrix of the multimodal model. The gold labels are plotted as y-axes and the predicted labels as x-axes. The matrix is normalized by rows. feature maps (Zhou et al., 2016). By visually observing the image heatmaps of the set of Instagram post pictures we note that in most cases it is quite difficult to determine a clear association between the emoji used by the user and some particular portion of the image. Detecting the correct emoji given an image is harder than a simple object recognition task, as the emoji choice depends on subjective emotions of the user who posted the image. In Figure 2 we show the first four predictions of the CNN for three pictures, and where the network focuses (in red). We can see that in the first example the network selects the smile with sunglasses because of the legs in the bottom of the image, the dog emoji is selected while focusing on the dog in the image, and the smiling emoji while focusing on the person in the back, who is lying on a hammock. In the second example the network selects again the due to the water and part of the kayak, the heart emoji focusing on the city landscape, and the praying emoji focusing on the sky. The same "praying" emoji is also selected when focusing on the luxury car in the third example, probably because the same emoji is used to express desire, i.e. "please, I want this awesome car".
It is interesting to note that images can give context to textual messages like in the following Instagram posts: (1)"Love my new home " (associated to a picture of a bright garden, outside) and (2) "I can't believe it's the first day of school!!! I love being these boys' mommy!!!! #myboys #mommy " (associated to picture of two boys wearing two blue shirts). In both examples the textual system predicts . While the multimodal system correctly predicts both of them: the blue color in the picture associated to (2) helps to change the color of the heart, and the sunny/bright picture of the garden in (1) helps to correctly predict .

Related Work
Modeling the semantics of emojis, and their applications, is a relatively novel research problem with direct applications in any social media task. Since emojis do not have a clear grammar, it is not clear their role in text messages. Emojis are considered function words or even affective markers (Na'aman et al., 2017), that can potentially affect the overall semantics of a message (Donato and Paggio, 2017).
Emoji sematics and usage have been studied with distributional semantics, with models trained on Twitter data (Barbieri et al., 2016c), Twitter data together with the official unicode description (Eisner et al., 2016), or using text from a popular keyboard app . In the same context, Wijeratne et al. (2017a) propose a platform for exploring emoji semantics. In order to further study emoji semantics, two datasets with pairwise emoji similarity, with human annotations, have been proposed: EmoTwi50 (Barbieri et al., 2016c) and EmoSim508 (Wijeratne et al., 2017b). Emoji similarity has been also used for proposing efficient keyboard emoji organization (Pohl et al., 2017). Recently, Barbieri and Camacho-Collados (2018) show that emoji modifiers (skin tones and gender) can affect the semantics vector representation of emojis.
Emoji play an important role in the emotional content of a message. Several sentiment lexicons for emojis have been proposed (Novak et al., 2015;Kimura and Katsurai, 2017;Rodrigues et al., 2018) and also studies in the context of emotion and emojis have been published recently (Wood and Ruder, 2016;Hu et al., 2017).
During the last decade several studies have shown how sentiment analysis improves when we jointly leverage information coming from different modalities (e.g. text, images, audio, video) (Morency et al., 2011;Poria et al., 2015;Tran and Cambria, 2018). In particular, when we deal with Social Media posts, the presence of both textual and visual content has promoted a number of investigations on sentiment or emotions (Baecchi et al., 2016;You et al., 2016b,a;Yu et al., 2016;Chen et al., 2015) or emojis (Cappallo et al., 2015(Cappallo et al., , 2018.

Conclusions
In this work we explored the use of emojis in a multimodal context (Instagram posts). We have shown that using a synergistic approach, thus relying on both textual and visual contents of social media posts, we can outperform state of the art unimodal approaches (based only on textual contents). As future work, we plan to extend our models by considering the prediction of more than one emoji per Social Media post and also considering a bigger number of labels.