Interpretable Emoji Prediction via Label-Wise Attention LSTMs

Human language has evolved towards newer forms of communication such as social media, where emojis (i.e., ideograms bearing a visual meaning) play a key role. While there is an increasing body of work aimed at the computational modeling of emoji semantics, there is currently little understanding about what makes a computational model represent or predict a given emoji in a certain way. In this paper we propose a label-wise attention mechanism with which we attempt to better understand the nuances underlying emoji prediction. In addition to advantages in terms of interpretability, we show that our proposed architecture improves over standard baselines in emoji prediction, and does particularly well when predicting infrequent emojis.


Introduction
Communication in social media differs from more standard linguistic interactions across a wide range of dimensions. Immediacy, short text length, the use of pseudowords like #hashtags or @mentions, and even metadata such as user information or geolocalization are essential components of social media messages. In addition, the use of emojis, small ideograms depicting objects, people and scenes (Cappallo et al., 2015), are becoming increasingly important for fully modeling the underlying semantics of a social media message, be it a product review, a tweet or an Instagram post. Emojis are the evolution of characterbased emoticons (Pavalanathan and Eisenstein, 2015), and are extensively used, not only as sentiment carriers or boosters, but more importantly, to express ideas about a myriad of topics, e.g., mood ( ), food ( ), sports ( ) or scenery ( ).
Emoji modeling and prediction is, therefore, an important problem towards the end goal of properly capturing the intended meaning of a so-cial media message. In fact, emoji prediction, i.e., given a (usually short) message, predict its most likely associated emoji(s), may help to improve different NLP tasks (Novak et al., 2015), such as information retrieval, generation of emojienriched social media content or suggestion of emojis when writing text messages or sharing pictures online. It has furthermore proven to be useful for sentiment analysis, emotion recognition and irony detection (Felbo et al., 2017). The problem of emoji prediction, albeit recent, has already seen important developments. For example, Barbieri et al. (2017) describe an LSTM model which outperforms a logistic regression baseline based on word vector averaging, and even human judgement in some scenarios.
The above contributions, in addition to emoji similarity datasets (Barbieri et al., 2016;Wijeratne et al., 2017) or emoji sentiment lexicons (Novak et al., 2015;Wijeratne et al., 2016;Kimura and Katsurai, 2017;Rodrigues et al., 2018), have paved the way for better understanding the semantics of emojis. However, our understanding of what exactly the neural models for emoji prediction are capturing is currently very limited. What is a model prioritizing when associating a message with, for example, positive ( ), negative ( ) or patriotic ( ) intents? A natural way of assessing this would be to implement an attention mechanism over the hidden states of LSTM layers. Attentive architectures in NLP, in fact, have recently received substantial interest, mostly for sequenceto-sequence models (which are useful for machine translation, summarization or language modeling), and a myriad of modifications have been proposed, including additive (Bahdanau et al., 2015), multiplicative (Luong et al., 2015) or self (Lin et al., 2017)   tant for the overall prediction distribution. While emoji prediction has predominantly been treated as a multi-class classification problem in the literature, it would be more informative to analyze which text fragments are considered important for each individual emoji. With this motivation in mind, in this paper we put forward a label-wise mechanism that operates over each label during training. The resulting architecture intuitively behaves like a batch of binary mini-classifiers, which make decisions over one single emoji at a time, but without the computational burden and risk of overfitting associated with learning separate LSTMbased classifiers for each emoji. Our contribution in this paper is twofold. First, we use the proposed label-wise mechanism to analyze the behavior of neural emoji classifiers, exploiting the attention weights to uncover and interpret emoji usages. Second, we experimentally compare the effect of the label-wise mechanism on the performance of an emoji classifier. We observed a performance improvement over competitive baselines such as FastText (FT) (Joulin et al., 2017) and Deepmoji (Felbo et al., 2017), which is most noticeable in the case of infrequent emojis. This suggests that an attentive mechanism can be leveraged to make neural architectures more sensitive to instances of underrepresented classes.

Methodology
Our base architecture is the Deepmoji model (Felbo et al., 2017), which is based on two stacked word-based bi-directional LSTM recurrent neural networks with skip connections between the first and the second LSTM. The model also includes an attention module to increase its sensitivity to individual words during prediction. In general, attention mechanisms allow the model to focus on specific words of the input (Yang et al., 2016), instead of having to memorize all the important features in a fixed-length vector. The main architectural difference with respect to the typical attention is illustrated in Figure 1.
In Felbo et al. (2017), attention is computed as follows: Here h i ∈ R d is the hidden representation of the LSTM corresponding to the i th word, with N the total number of words in the sentence. The weight vector w a ∈ R d and bias term b a ∈ R map this hidden representation to a value that reflects the importance of this state for the considered classification problem. The values z 1 , ..., z n are then normalized using a softmax function, yielding the attention weights α i . The sentence representation s is defined as a weighted average of the vectors h i . The final prediction distribution is then defined as follows: where w f,l ∈ R d and b f,l define a label-specific linear transformation, with β l reflecting our confidence in the l th label and L is the total number of labels. The confidence scores β l are then normalized to probabilities using another softmax operation. However, while the above design has contributed to better emoji prediction, in our case we are interested in understanding the contribution of the words of a sentence for each label (i.e., emoji), and not in the whole distribution of the target labels. To this end, we propose a label-wise attention mechanism. Specifically, we apply the same type of attention, but repeating it |L| (number of labels) times, where each attention module is reserved for a specific label l:

Evaluation
This section describes the main experiment w.r.t the performance of our proposed attention mechanism, in comparison with existing emoji prediction systems. We use the data made available in the context of the SemEval 2018 Shared Task on Emoji Prediction (Barbieri et al., 2018). Given a tweet, the task consists of predicting an associated emoji from a predefined set of 20 emoji labels. We evaluate our model on the English split of the official task dataset. We also show results from additional experiments in which the label space ranged from 20 to 200 emojis. These extended experiments are performed on a corpus of around 100M tweets geolocalized in the United States and posted between October 2015 and May 2018.
Models. In order to put our proposed labelwise attention mechanism in context, we compare its performance with a set of baselines: (1) FastText (Joulin et al., 2017) (FT), which was the official baseline in the SemEval task; (2) 2   Results. Table 1 shows the results of our model and the baselines in the emoji prediction task for the different evaluation splits. The evaluation metrics used are: F1, Accuracy@k (A@k, where k ∈ {1, 5}), and Coverage Error (CE 1 ) (Tsoumakas et al., 2009). We note that the latter metric is not normally used in emoji prediction settings. However, with many emojis being "near synonyms" (in the sense of being often used almost interchangeably), it seems natural to evaluate the performance of an emoji prediction system in terms of how far we would need to go through the predicted emojis to recover the true label. The results show that our proposed 2-BiLSTMs l method outperforms all baselines for F1 in three out of four settings, and for CE in all of them. In the following section we shed light on the reasons behind this performance, and we try to understand how these predictions were made. x-axis represents emoji labels, ranked from most to least frequent. Lower scores indicate a higher average rank predicted by our proposed label-wise attention mechanism.

Analysis
By inspecting the predictions of our model, we found that the label-wise attention mechanism tends to be less heavily biased towards the most frequent emojis. This is reflected in the lower coverage error results in all settings, and becomes more noticeable as the number of labels grows. We verified this by computing the average difference between ranked predictions of the two attentive models in the 200-label setting (Figure 2). We can observe a sudden switch at more or less the median emoji, after which the label-wise attention model becomes increasingly accurate (relative to the standard attention model). This can be explained by the fact that infrequent emojis tend to be more situational (used in specific contexts and leaving less room for ambiguity or interchangeability), which the label-wise attention mechanism can take advantage of, as it explicitly links emojis with highly informative words. Let us illustrate this claim with a case in which the label-wise attention model predicts the correct emoji, unlike its single-attention counterpart: a friendship is built over time , but sisterhood is given automatically. Gold: For the above example 2 , the predictions of the single attention model were all linked to the general meaning of the message, that is love and friendship, leading it to predict associated emojis ( , and ), failing to capture the most relevant bit of information. On the other hand, our proposed model "picks on" the word sisterhood, and with 2 The highlights show the α l attention weights of .  the added context of the surrounding words, ranks the gold label 3 in 4th position, which would be a true positive as per A@5.
Let us explore what we argue are interesting cases of emoji usage (ranging from highly explicit to figurative or situtational intent). Figure 3 shows how the word (praying) and emojis such as and are strongly correlated. In addition, the bond between the word snow and the emoji is also indisputable. However, a perhaps more surprising example is displayed in Figure 4, which is a negative example. Here, the emoji was predicted with rank 1, and we see it being strongly associated with the ordinal second, suggesting that the model assumed this was some kind of "ticked enumeration" of completed tasks, which is indeed regular practice in Twitter. Finally, we found it remarkable that the ambiguous nature of the word boarding is also reflected in two different emojis being predicted with high probability ( and ), each of them showcasing one of the word's senses.
As an additional exploratory analysis, we computed statistics on those words with the highest average attention weights associated with one single emoji. One interesting example is the emoji, which shows two clear usage patterns: one literal (a tree) and one figurative (christmas and holidays). Finally, as a final (and perhaps thoughtprovoking) finding, the highest attention weights associated to the emoji were given to the words game, boys and football, in that order. In other words, the model relies more on the word boys than on the actual description of the emoji. This is in line with a previous study that showed how the current usage of emojis in Twitter is in some cases associated with gender stereotypes (Barbieri and Camacho-Collados, 2018

Conclusion
In this paper we have presented a neural architecture for emoji prediction based on a label-wise attention mechanism, which, in addition to improving performance, provides a degree of interpretability about how different features are used for predictions, a topic of increasing interest in NLP (Linzen et al., 2016;Palangi et al., 2017). As we experimented with sets of emoji labels of different sizes, our proposed label-wise attention architecture proved especially well-suited for emojis which were infrequent in the training data, making the system less biased towards the most frequent. We see this as a first step to improve the robustness of recurrent neural networks in datasets with unbalanced distributions, as they were shown not to perform better than well-tuned SVMs on the emoji predicion task (Çöltekin and Rama, 2018).
As for future work, we plan to apply our labelwise attention mechanism to understand other interesting linguistic properties of human-generated text in social media, and other multi-class or multilabel classification problems.
Finally, code to reproduce our experiments and additional examples of label-wise attention weights from input tweets can be downloaded at https://fvancesco.github. io/label_wise_attention/.