SemEval 2018 Task 2: Multilingual Emoji Prediction

This paper describes the results of the first Shared Task on Multilingual Emoji Prediction, organized as part of SemEval 2018. Given the text of a tweet, the task consists of predicting the most likely emoji to be used along such tweet. Two subtasks were proposed, one for English and one for Spanish, and participants were allowed to submit a system run to one or both subtasks. In total, 49 teams participated to the English subtask and 22 teams submitted a system run to the Spanish subtask. Evaluation was carried out emoji-wise, and the final ranking was based on macro F-Score. Data and further information about this task can be found at https://competitions.codalab.org/competitions/17344.


Introduction
Emojis are small ideograms depicting objects, people, and scenes (Cappallo et al., 2015). Emojis are one of the main components of a novel way of communication emerging from the advent of social media. They complement (usually) short text messages with a visual enhancement which is, as of now, a de-facto standard for online communication (Barbieri et al., 2017). Figure 1 shows an example of a social media message displaying an emoji.
Sometimes I think I wanna change the world... and I forget it just starts with changing me. Emojis 1 can be considered somehow an evolution of character-based emoticons (Pavalanathan and Eisenstein, 2015), and currently they represent a widespread and pervasive global communication device largely adopted by almost any social media service and instant messaging platforms.
Any system targeting the task of modeling social media communication is expected to tackle the usage of emojis. In fact, their semantic load is sufficiently rich that oversimplifying them to sentiment carriers or boosters would be to neglect the semantic richness of these ideograms, which in addition to mood ( ) include in their vocabulary references to food ( ), sports ( ), scenery ( ), etc 2 . In general, however, effectively predicting the emoji associated with a piece of content may help to improve different NLP tasks (Novak et al., 2015), such as information retrieval, generation of emoji-enriched social media content, suggestion of emojis when writing text messages or sharing pictures online. Given that emojis may also mislead humans (Barbieri et al., 2017;Miller et al., 2017), the automated prediction of emojis may help to achieve better language understanding. As a consequence, by modeling the semantics of emojis, we can improve highly-subjective tasks like sentiment analysis, emotion recognition and irony detection (Felbo et al., 2017).
In this context, Barbieri et al. (2017) introduced the task of emoji prediction in Twitter by training several models based on bidirectional Long Short-Term Memory networks (LSTMs) (Graves, 2012), and showing they can outperform humans in solv-ing the same task. These promising results motivated us to propose the first shared task on Multilingual Emoji Prediction. Following the experimental setting proposed by Barbieri et al. (2017), the task consists of predicting most likely emoji associated of a given text-only Twitter message. Only tweets with a single emoji are included in the task datasets (trial, train and test sets), so that the challenge can be cast as a single label classification problem.
In this paper, we first motivate and describe the main elements of this shared task (Section 2 and 3). Then, we cover the dataset compilation, curation and release process (Section 4). In Section 5 we detail the evaluation metrics and describe the overall results obtained by participating systems. Finally, we wrap this task description paper up with the main conclusions drawn from the organization of this challenge, as well as outlining potential avenues for future work, in Section 6.

Related Work
Modeling the semantics of emojis, and their applications thereof, is a relatively novel research problem with direct applications in any social media task. By explicitly modeling emojis as selfcontaining semantic units, the goal is to alleviate the lack of an associated grammar. This context, which makes it difficult to encode a clear and univocous single meaning for each emoji, has given rise to work considering emojis as function words or even affective markers (Na'aman et al., 2017), potentially affecting the overall semantics of longer utterances like sentences (Monti et al., 2016;Donato and Paggio, 2017).
The polysemy of emoji has been explored userwise (Miller et al., 2017), location-wise, specifically in countries (Barbieri et al., 2016b) and cities (Barbieri et al., 2016a), gender-wise, time-wise (Barbieri et al., 2018bChen et al., 2017), and even device-wise, due to the fact that emojis may have different pictorial characteristics (and therefore, different interpretations), depending on the device (e.g., Iphone, Android, Samsung, etc.) or app (Whatsapp, Twitter, Facebook, and so forth) 3 (Tigwell and Flatla, 2016;Miller et al., 2016). 3 The image that represents the same emoji can vary, e.g., for the emoji U+1F40F, the following are over different renderings by platform in Unicode v11 (up to April 2018): Apple , Google , Twitter , EmojiOne , Facebook , Samsung , Windows . Today, modeling emoji semantics via vector representations is a well defined avenue of work. Contributions in this respect include models trained on Twitter data (Barbieri et al., 2016c), Twitter data together with the official unicode description (Eisner et al., 2016), or using text from a popular keyboard app Ai et al. (2017). In the latter contribution it is argued that emojis used in an affective context are more likely to become popular, and in general, the most important factor for an emoji to become popular is to have a clear meaning. In fact, the area of emoji vector evaluation has also experienced a significant growth as of recent. For instance, Wijeratne et al. (2017a) propose a platform for exploring emoji semantics. Further studies on evaluating emoji semantics may now be carried out by leveraging two recently introduced datasets with pairwise emoji similarity, with human annotations, namely EmoTwi50 (Barbieri et al., 2016c) and EmoSim508 (Wijeratne et al., 2017b). In the application avenue, emoji similarity has been studied for proposing efficient keyboard emoji organization, essentially for placing similar emojis close in the keyboard (Pohl et al., 2017).
An aspect related with emoji semantic modeling in which awareness is increasing dramatically is the inherent bias existing in these representations. For example, Barbieri and Camacho-Collados (2018) show that emoji modifiers can affect the semantics of emojis (they looked specifically into skin tones and gender). This recent line of research has also been explored in Robertson et al. (2018) who argue, for example, that users with darker-skinned profile photos employ skin modifiers more often than users with lighterskinned profile photos, and that the vast majority of skin tone usage matches the color of a user's profile photo.
The application of well defined emoji representations in extrinsic tasks is, an open area of research. A natural application, however, lies in the context of sentiment analysis. This has fostered research, for example, in creating sentiment lexicons for emojis (Novak et al., 2015;Kimura and Katsurai, 2017;Rodrigues et al., 2018), or in studying how emojis may be used to retrieve tweets with specific emotional content (Wood and Ruder, 2016). Moreover, Hu et al. (2017) study how emojis affect the sentiment of a text message, and show that not all emojis have the same impact. Finally, the fact that emojis carry sentiment and emotion information is verified in the study by Felbo et al. (2017), where an emoji prediction classifier is used as pre-trained system, and then is fine-tuned for predicting sentiment, emotions and irony.
The last item to be covered in this review involves multimodality. Recently, emojis have been also studied from a prism where visual signals are incorporated, taking advantage of existing social media platforms like Instagram, with a strong focus on visual content. Recent contributions show that the usage of emojis depends on both textual and visual content, but seem to agree in that, in general, textual information is more relevant for the task of emoji prediction (Cappallo et al., 2015(Cappallo et al., , 2018Barbieri et al., 2018a).

Task Description
Given a text message including an emoji, the emoji prediction task consists of predicting that emoji by relying exclusively on the textual content of that message. In particular, in this task we focused on the one emoji occurring inside tweets, thus relying on Twitter data.
Last hike in our awesome camping weekend! Figure 2: Example of tweet with an emoji at the end, considered in the emoji prediction task.
The task is divided into two subtasks respectively dealing with the prediction of the emoji associated to English and Spanish tweets. The motivation for providing a multilingual setting stems from previous findings about the idiosyncrasy of use of emojis across languages (Barbieri et al., 2016b) (see Figure 3): one emoji may be used with completely different meanings depending not only on the language of the speaker, but also on regional dialects (Barbieri et al., 2016a).
For each subtask we selected the tweets that included one of the twenty emojis that occur most frequently in the Twitter data we collected (Table  1). Therefore, the task can be viewed as a multilabel classification problem with twenty labels.
Twitter datasets were shared among participants by providing a list of tweet IDs 4 or directly the 4 Participants were provided with a Javabased crawler (https://github.com/fra82/ twitter-crawler) to ease the download of the textual It's flipping hot out here! Iniciamos el nuevo año con ilusión! Figure 3: Example of distinct use of the fire emoji across languages: the first tweet (English) comments on the torrid weather, while the second one (Spanish) exploits the same emoji to wish an happy new year ('We start the new year with enthusiasm!').

Task Data
The data for the task consists of a list of tweets associated with a given emoji (i.e. label). As explained in the previous section, the dataset includes tweets that contain one and only one emoji, of the 20 most frequent emojis. We split the data in trial 5 , training and test data. The quantity of tweets per set is displayed in Table 2.
The tweets were retrieved with the Twitter APIs and geolocalized in United States and Spain for subtasks 1 and 2, respectively. As for the trial and training data, the tweets were gathered from October 2015 to February 2017, whereas for the test data we decided to gather the tweets corresponding to the last months until the evaluation period started (from May 2017 to Jan 2018). This would prevent participants from gathering these tweets before-hand and also would enable us to test the emoji prediction task on a more realistic setting, as the test data is subsequent to the training data. content of tweets from the ID list.
5 Trial data was used as development by participants.

Evaluation
This section introduces the overall evaluation setting of this shared task. We first describe briefly the evaluation metrics used and then provide a succinct description of the baseline system.

Evaluation Metrics
As this was a single label classification problem, the classic precision (Prec.), recall (Recall), fscore (F1) and accuracy (Acc.) were used as official evaluation metrics. Note that because of the skewed distribution of the label set we opted for macro average over all labels.

Baseline
The baseline system for this task was a classifier based on FastText 6 (Joulin et al., 2017). Given a set of N documents, the loss that the model attempts to minimize is the negative log-likelihood over the labels (in our case, the emojis): where e n is the emoji included in the n-th Twitter post, represented as hot vector, and used as label.
Hyperparameters were set as default 7 .
briefly mention the main features of some significant systems ranked above the baseline in either of the subtasks.
This supervised system consists of an SVM classifier with bag-of-n-grams features (both characters and words). Tübingen-Oslo is the top performing system in both tasks.
• NTUA-SLP (Baziotis et al., 2018). This system uses a Bi-LSTM with attention, and pretrained word2vec vectors. They used external resources for associating each tweet with information on emotions, concreteness, familiarity, and others. They only participated in the English subtask but they classified second (according to the F1 score) with the highest recall.
• EmoNLP (Liu, 2018). This system is based on a Gradient Boosting Regression Tree Approach combined with a Bi-LSTM on character and word ngrams. It is complemented with several lexicons as well as learning sentiment specific word embeddings.
• UMDuluth-CS8761 (Beaulieu and Asamoah Owusu, 2018) This supervised system combines an SVM with a bag-of-words approach for extracting salient features. This is one of the most competitive systems with the highest precision in English and the third best result in Spanish.
• Hatching Chick (Coster et al., 2018). This system builds an SVM classifier (with gradient descent optimization) on words and character ngrams. They obtained the second best result in the Spanish subtask, but their English system performed worse than the baseline.
• TAJJEB (Basile and Lino, 2018). This system made use of an SVM classifier over wide variety of features such as tf-idf, part-ofspeech tags and bigrams. The system was competitive on both languages, outperforming the baseline on the Spanish dataset.
• Duluth UROP (Jin and Pedersen, 2018 (20) and Spanish (19). We also report the relative frequency percentage of each emoji in the test set.

Results
Each system was evaluated according to its capacity to perform well across all emojis under consideration. As mentioned, and due to the skewed distribution of the label set, we evaluated each participating system according to Macro F-Score (F1). The overall results are provided in Table 3, and already several interesting conclusions can be drawn from them. For instance, it is noteworthy the fact that the best systems for both subtasks are more than 10 points apart (English better), which suggests that a one-size-fits-all model may be suboptimal for this task, and that indeed the particularities of each individual language should be taken into consideration for best performance. The most precise systems were EmoNLP and Tübingen-Oslo, whereas the highest Recall was obtained by NTUA-SLP and again Tübingen-Oslo (English and Spanish respectively, in both cases). Clearly, the Tübingen-Oslo system shows a fine balance between precision and recall, perhaps due to its little preprocessing, fine-tuning and reliance on external libraries. It seems reasonable to assume, thus, that combining word and ngram embeddings as features, with SVMs and NN classifiers, provides a robust and high performing architecture for emoji prediction, with the added value of being resource/knowledge agnostic.

Analysis
This evaluation is finally complemented with the overall emoji-wise performance across all systems ( Table 4). The lexical notion of near synonymy seems to clearly apply to emojis as well, as we can clearly see a worse performance on those emojis which are pictorically similar (e.g., the photo camera with and without flash, or the expected confusion between least frequent hearts and the red heart, which accounts for over 20% of the whole label set in the test data).
Finally, emojis with several interpretations and less frequent seem to be much more difficult to predict (e.g., the face in the English and Spanish dataset, and in the Spanish dataset). Zhou et al. (2018) showed in their system description paper how exploiting user-specific features may provide significance performance boosts. 9 This additional user-specific information may clearly help in these difficult cases which proved to be hard for all systems.

Conclusions
In this paper we have described the SemEval 2018 shared task in multilingual emoji prediction. The task, consisting in predicting the most likely emoji given the text of a tweet, was well received, with almost 50 system runs submitted to the English subtask and more than 20 to the Spanish subtask. One of the main conclusions that can be drawn is that the baseline we used (FastText) was highly competitive, with only 6 and 5 system runs performing better in English and Spanish.  In terms of participating systems, and according to the post-participation survey the participants completed, we can see a high prevalence of neural approaches, with only 9 systems opting for more traditional linear models (6 SVMs, 3 Random Forests). Among the chosen neural architectures, LSTMs and CNNs are by far the preferred ones. It is noteworthy, however, the excellent performance of SVMs as used in the best performing system on both English and Spanish datasets.
This task has set the foundations for upcoming work on modeling emoji semantics, first, by providing a standardized testbed for emoji prediction in two languages, and second, by providing a comprehensive evaluation with a wide range of ideas, which we hope are of use for future research. Emojis, undoubtedly, are becoming increasingly important in understanding social media communication and in human-computer interaction, and thus we believe the problem of modeling emoji semantics can be further extended as follows.
(1) Leveraging multimodal information (e.g., associated images (Barbieri et al., 2018a)); (2) incorporating more and more diverse languages (one step in this direction will be the re-run of this task for Italian at the Evalita 2018 evaluation campaign 10 ); and (3) considering individual and communicative contexts for overall performance improvements.