Are Emojis Predictable?

Emojis are ideograms which are naturally combined with plain text to visually complement or condense the meaning of a message. Despite being widely used in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint. In this paper, we investigate the relation between words and emojis, studying the novel task of predicting which emojis are evoked by text-based tweet messages. We train several models based on Long Short-Term Memory networks (LSTMs) in this task. Our experimental results show that our neural model outperforms a baseline as well as humans solving the same task, suggesting that computational models are able to better capture the underlying semantics of emojis.


Introduction
The advent of social media has brought along a novel way of communication where meaning is composed by combining short text messages and visual enhancements, the so-called emojis. This visual language is as of now a de-facto standard for online communication, available not only in Twitter, but also in other large online platforms such as Facebook, Whatsapp, or Instagram.
Despite its status as language form, emojis have been so far scarcely studied from a Natural Language Processing (NLP) standpoint. Notable exceptions include studies focused on emojis' semantics and usage (Aoki and Uchida, 2011;Barbieri et al., 2016a;Barbieri et al., 2016b;Barbieri et al., 2016c;Eisner et al., 2016;Ljubešic and Fišer, 2016), or sentiment (Novak et al., 2015). However, the interplay between text-based messages and emojis remains virtually unexplored. This paper aims to fill this gap by investigating the relation between words and emojis, studying the problem of predicting which emojis are evoked by textbased tweet messages. Miller et al. (2016) performed an evaluation asking human annotators the meaning of emojis, and the sentiment they evoke. People do not always have the same understanding of emojis, indeed, there seems to exist multiple interpretations of their meaning beyond their designer's intent or the physical object they evoke 1 . Their main conclusion was that emojis can lead to misunderstandings. The ambiguity of emojis raises an interesting question in human-computer interaction: how can we teach an artificial agent to correctly interpret and recognise emojis' use in spontaneous conversation? 2 The main motivation of our research is that an artificial intelligence system that is able to predict emojis could contribute to better natural language understanding (Novak et al., 2015) and thus to different natural language processing tasks such as generating emoji-enriched social media content, enhance emotion/sentiment analysis systems, and improve retrieval of social network material.
In this work, we employ a state of the art classification framework to automatically predict the most likely emoji a Twitter message evokes. The model is based on Bidirectional Long Short-term Memory Networks (BLSTMs) with both standard lookup word representations and character-based representation of tokens. We will show that the BLSTMs outperform a bag of words baseline, a baseline based on semantic vectors, and human annotators in this task. 100.7 89.9 59 33.8 28.6 27.9 22.5 21.5 21 20.8 19.5 18.6 18.5 17.5 17 16.1 15.9 15.2 14.2 10.9 Table 1: The 20 most frequent emojis that we use in our experiments and the number of thousand tweets they appear in.

Dataset and Task
Dataset: We retrieved 40 million tweets with the Twitter APIs 3 . Tweets were posted between October 2015 and May 2016 geo-localized in the United States of America. We removed all hyperlinks from each tweet, and lowercased all textual content in order to reduce noise and sparsity. From the dataset, we selected tweets which include one and only one of the 20 most frequent emojis, resulting in a final dataset 4 composed of 584,600 tweets. In the experiments we also consider the subsets of the 10 (502,700 tweets) and 5 most frequent emojis (341,500 tweets). See Table 1 for the 20 most frequent emojis that we consider in this work. Task: We remove the emoji from the sequence of tokens and use it as a label both for training and testing. The task for our machine learning models is to predict the single emoji that appears in the input tweet.

Models
In this Section, we present and motivate the models that we use to predict an emoji given a tweet. The first model is an architecture based on Recurrent Neural Networks (Section 3.1) and the second and third are the two baselines (Section 3.2.1 and 3.2.2). The two major differences between the RNNs and the baselines, is that the RNNs take into account sequences of words and thus, the entire context.

Bi-Directional LSTMs
Given the proven effectiveness and the impact of recurrent neural networks in different tasks (Chung et al., 2014;Vinyals et al., 2015;Dzmitry et al., 2014;Lample et al., 2016;Wang et al., 2016, inter-alia), which also includes modeling of tweets (Dhingra et al., 2016), our emoji prediction model is based on bi-directional Long Short-term Memory Networks (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005). The B-LSTM can be formalized as follows: where W is a learned parameter matrix, fw is the forward LSTM encoding of the message, bw is the backward LSTM encoding of the message, and d is a bias term, then passed through a componentwise ReLU. The vector s is then used to compute the probability distribution of the emojis given the message as: p(e | s) = exp g e s + q e e ∈E exp g e s + q e where g e is a column vector representing the (output) embedding 5 of the emoji e, and q e is a bias term for the emoji e. The set E represents the list of emojis. The loss/objective function the network aims to minimize is the following: where m is a tweet of the training set T , s is the encoded vector representation of the tweet and e m is the emoji contained in the tweet m. The inputs of the LSTMs are word embeddings 6 . Following, we present two alternatives explored in the experiments presented in this paper. Word Representations: We generate word embeddings which are learned together with the updates to the model. We stochastically replace (with p = 0.5) each word that occurs only once in the training data with a fixed represenation (outof-vocabulary words vector). When we use pretrained word embeddings, these are concatenated with the learned vector representations obtaining a final representation for each word type. This is similar to the treatment of word embeddings by . Character-based Representations: We compute character-based continuous-space vector embeddings (Ling et al., 2015b; of the tokens in each tweet using, again, bidirectional LSTMs. The character-based approach learns representations for words that are orthographically similar, thus, they should be able to handle different alternatives of the same word type occurring in social media.

Baselines
In this Section we describe the two baselines. Unlike the previous model, the baselines do not take into account the word order. However, in the second baseline (Section 3.2.2) we abstract on the plain word representation using semantic vectors, previously trained on Twitter data.

Bag of Words
We applied a bag of words classifier as baseline, since it has been successfully employed in several classification tasks, like sentiment analysis and topic modeling (Wallach, 2006;Blei, 2012;Titov and McDonald, 2008;Maas et al., 2011;Davidov et al., 2010). We represent each message with a vector of the most informative tokens (punctuation marks included) selected using term frequency−inverse document frequency (TF-IDF). We employ a L2-regularized logistic regression classifier to make the predictions.

Skip-Gram Vector Average
We train a Skip-gram model (Mikolov et al., 2013) learned from 65M Tweets (where testing instances have been removed) to learn Twitter semantic vectors. Then, we build a model (henceforth, AVG) which represents each message as the average of the vectors corresponding to each token of the tweet. Formally, each message m is represented with the vector V m : Where T m are the set of tokens included in the message m, S t is the vector of token t in the Skipgram model, and |T m | is the number of tokens in m. After obtaining a representation of each message, we train a L2-regularized logistic regression, (with ε equal to 0.001).

Experiments and Evaluation
In order to study the relation between words and emojis, we performed two different experiments.
In the first experiment, we compare our machine learning models, and in the second experiment, we pick the best performing system and compare it against humans.

First Experiment
This experiment is a classification task, where in each tweet the unique emoji is removed and  Table 2: Results of 5, 10 and 20 emojis. Precision, Recall, F-measure. BOW is bag of words, AVG is the Skipgram Average model, C refers to char-BLSTM and W refers to word-BLSTM. +P refers to pretrained embeddings.
used as a label for the entire tweet. We use three datasets, each containing the 5, 10 and 20 most frequent emojis (see Section 2). We analyze the performance of the five models described in Section 3: a bag of words model, a Bidirectional LSTM model with character-based representations (char-BLSTM), a Bidirectional LSTM model with standard lookup word representations (word-BLSTM). The latter two were trained with/without pretrained word vectors. To pretrain the word vectors, we use a modified skip-gram model (Ling et al., 2015a) trained on the English Gigaword corpus 7 version 5.
We divide each dataset in three parts, training (80%), development (10%) and testing (10%). The three subsets are selected in sequence starting from the oldest tweets and from the training set since automatic systems are usually trained on past tweets, and need to be robust to future topic variations. Table 2 reports the results of the five models and the baseline. All neural models outperform the baselines in all the experimental setups. However, the BOW and AVG are quite competitive, suggesting that most emojis come along with specific words (like the word love and the emoji ). However, considering sequences of words in the models seems important for encoding the meaning of the tweet and therefore contextualize the emojis used. Indeed, the B-LSTMs models always outperform BOW and AVG. The character-based model with pretrained vectors is the most accurate at predicting emojis. The character-based model seems to capture orthographic variants of the same word in social media. Similarly, pretrained vectors allow to initialize the system with unsuper-7 https://catalog.ldc.upenn.edu/LDC2003T05 vised pre-trained semantic knowledge (Ling et al., 2015a), which helps to achieve better results.  Table 3: Precision, Recall, F-measure, Ranking and occurrences in the test set of the 20 most frequent emojis using char-BLSTM + Pre.
Qualitative Analysis of Best System: We analyze the performances of the char-BLSTM with pretrained vectors on the 20-emojis dataset, as it resulted to be the best system in the experiment presented above. In Table 3 we report Precision, Recall, F-measure and Ranking 8 of each emoji. We also added in the last column the occurrences of each emoji in the test set. The frequency seems to be very relevant. The Ranking of the most frequent emojis is lower than the Ranking of the rare emojis. This means that if an emoji is frequent, it is more likely to be on top of the possible choices even if it is a mistake. On the other hand, the F-measure does not seem to depend on frequency, as the highest F-measures are scored by a mix of common and uncommon emojis ( , , , and ) which are respectively the 8 The Ranking is a number between 1 and 20 that represents the average number of emojis with higher probability than the gold emoji in the probability distribution of the classifier. first, second, the sixth and the second last emoji in terms of frequencies.
The frequency of an emoji is not the only important variable to detect the emojis properly; it is also important whether in the set of emojis there are emojis with similar semantics. If this is the case the model prefers to predict the most frequent emojis. This is the case of the emoji that is almost never predicted, even if the Ranking is not too high (4.69). The model prefers similar but most frequent emojis, like (instead of ). The same behavior is observed for the emoji, but in this case the performance is a bit better due to some specific words used along with the blue heart: "blue", "sea" and words related to childhood (e.g. "little" or "Disney").
Another interesting case is the Christmas tree emoji , that is present only three times in the test set (as the test set includes most recent tweets and Christmas was already over; this emoji is commonly used in tweets about Christmas). The model is able to recognize it twice, but missing it once. The correctly predicted cases include the word "Christmas"; and it fails to predict: "getting into the holiday spirit with this gorgeous pair of leggings today ! #festiveleggings", since there are no obvious clues (the model chooses instead probably because of the intended meaning of "holiday" and "gorgeous".).
In general the model tends to confuse similar emojis to and , probably for their higher frequency and also because they are used in multiple contexts. An interesting phenomenon is that is often confused with . The first one represent a small face crying, and the second one a small face laughing, but the results suggest that they appear in similar tweets. The punctuation and tone used is often similar (many exclamation marks and words like "omg" and "hahaha"). Irony may also play a role to explain the confusion, e.g. "I studied journalism and communications , I'll be an awesome speller! Wrong. haha so much fun".

Second Experiment
Given that Miller et al. (2016) pointed out that people tend to give multiple interpretations to emojis, we carried out an experiment in which we evaluated human and machine performances on the same task. We randomly selected 1,000 tweets from our test set of the 5 most frequent emojis used in the previous experiment, and asked  Table 4: Precision, Recall and F-Measure of human evaluation and the character-based B-LSTM for the 5 most frequent emojis and 1,000 tweets.
humans to predict, after reading a tweet (with the emoji removed), the emoji the text evoked. We opted for the 5 emojis task to reduce annotation efforts. After displaying the text of the tweet, we asked the human annotators "What is the emoji you would include in the tweet?", and gave the possibility to pick one of 5 possible emojis , , , , and . Using the crowdsourcing platform ''CrowdFlower", we designed an experiment where the same tweet was presented to four annotators (selecting the final label by majority agreement). Each annotator assessed a maximum of 200 tweets. The annotators were selected from the United States of America and of high quality (level 3 of CrowdFlower). One in every ten tweets, was an obvious test question, and annotations from subjects who missed more than 20% of the test questions were discarded. The overall inter-annotator agreement was 73% (in line with previous findings (Miller et al., 2016)). After creating the manually annotated dataset, we compared the human annotation and the char-BLSTM model with the gold standard (i.e. the emoji used in the tweet).
We can see in Table 4, where the results of the comparison are presented, that the char-BLSTM performs better than humans, with a F1 of 0.65 versus 0.50. The emojis that the char-BLSTM struggle to predict are and , while the human annotators mispredict and mostly. We can see in the confusion matrix of Figure 1 that is misclassified as by both human and LSTM, and the emoji is mispredicted as and . An interesting result is the number of times was chosen by human annotators; this emoji occurred 100 times (by chance) in the test set, but it was chosen 208 times, mostly when the correct label was the laughing emoji . We do not observe the same be- Figure 1: Confusion matrix of the second experiment. On the left the human evaluation and on the right the char-BLSTM model. havior in the char-BLSTMs, perhaps because they encoded information about the probability of these two emojis and when in doubt, the laughing emoji was chosen as more probable.

Conclusions
Emojis are used extensively in social media, however little is known about their use and semantics, especially because emojis are used differently over different communities (Barbieri et al., 2016a;Barbieri et al., 2016b). In this paper, we provide a neural architecture to model the semantics of emojis, exploring the relation between words and emojis. We proposed for the first time an automatic method to, given a tweet, predict the most probable emoji associated with it. We showed that the LSTMs outperform humans on the same emoji prediction task, suggesting that automatic systems are better at generalizing the usage of emojis than humans. Moreover, the good accuracy of the LSTMs suggests that there is an important and unique relation between sequences of words and emojis.
As future work, we plan to make the model able to predict more than one emoji per tweet, and explore the position of the emoji in the tweet, as close words can be an important clue for the emoji prediction task.