#TeamINF at SemEval-2018 Task 2: Emoji Prediction in Tweets

In this paper, we describe a methodology to predict emoji in tweets. Our approach is based on the classic bag-of-words model in conjunction with word embeddings. The used classification algorithm was Logistic Regression. This architecture was used and evaluated in the context of the SemEval 2018 challenge (task 2, subtask 1).


Introduction
Over the years, technology has significantly changed the way people communicate. It was changed especially due to social media like Twitter 1 , Facebook 2 , WhatsApp 3 , among others. Such media provide users with the ability to express their opinions/emotions not only with words, but through images, the so-called emojis.
However, within the context of the sentiment analysis, little research has been dedicated to explore the semantics of emoji (Barbieri et al., 2016), thus becoming an interesting challenge to investigate.
Understanding the meaning of emojis in relation to their context of use is important for indexing multimedia information, retrieval, or content extraction systems. In addition, emoji can complement the meaning of a message, that is, an emoji can determine the feeling of a text, however, such emotive figures may become fragile in the ironic/sarcastic context.
In this paper, we developed a methodology to predict emoji in tweets, especially our method is based on the bag-of-words model in conjunction with word embeddings (GloVe 4 pre-trained) and n-grams 5 , applying a classification algorithm.
This configuration was employed and evaluated in the SemEval 2018 challenge (task 2, subtask 1), in which the goal is to predict the emoji of a tweet (Barbieri et al., 2018).
This work is organized as follows: section 2 explains some related works, section 3 describes the data set, section 4 addresses the methodology applied in the task, section 5 presents the results, and finally section 6 final considerations as well as future work.

Related Works
Emojis can express diverse types of contents in a visual way, adapting to the informal style of communication in social networks. The meaning expressed by emoticons has been explored to allow or improve various tasks related to the sentiment analysis, as in (Hogenboom et al., 2013(Hogenboom et al., , 2015. Emojis can also be used to label excerpts of texts where they occur, thus making it possible to construct sentiment lexical. In this context, in (Go et al., 2009) and (Castellucci et al., 2015) use a distant supervision over the emotionally marked textual contents to form a sentiment classifier and construct a lexicon of polarity. While Novak et al. 2015 constructed lexicons and drew a map of sentiments of the 751 most used emoji.
In the work of Barbieri et al. 2017, the authors investigated the relationship between words and emojis, studying the new task of predicting which emoji are evoked by text-based tweet messages. The authors trained several models based on Long Memory Short-Term networks (LSTMs).
In (Barbieri et al., 2016) the authors explore the meaning and use of emojis in four languages: American English, British English, Peninsular Spanish and Italian. By performing several experiments the researchers were able to compare how the semantics of emoji vary according to the languages. In a first experiment, they investigated whether the meaning of a single emoji is preserved in all variations of language. In the second experiment, they compared the general semantic models of the 150 most frequent emoji in all languages. In this study it was possible to find out that the general semantics of the most frequent emoji is similiar.
Finally, given the context of the challenge of Semeval 2018 (task 2, subtask 1), we propose a model capable of predicting emoji corresponding to the tweets.

Dataset and Task
Dataset. The data for the task consists of 500k tweets in English for training, 50k for trial and 50k for test. The tweets were retrieved with the Twitter APIs, from October 2015 to February 2017, and geolocalized in United States. The dataset includes tweets that contain one and only one emoji, of the 20 most frequent emojis. The amount of tweets for dataset can be seen in Figure 1. Task details. Because of the importance of visual icons with the ability to provide additional meaning for social messaging and Twitter's key role as one of the most important communication platforms, the Semeval 2018 team invites participants to predict the emoji associated with a tweet in English (Barbieri et al., 2018).

Methodology
The methodology applied in this task consists of two phases, one based on the bag-of-words model and another based on the word embeddings (GloVe) model, in the end both are concatenated, as shown in Figure 2.

Preprocessing
This step consists in eliminating noises and terms that have no semantic significance in the sentiment prediction. For this, we perform the removal of links, removal of numbers, removal of special characters, removal of stop words (words with low discriminative power, for example, "is", "that" etc.). The standardization of tweets in lowercase was also applied, and finally, stemming. The purpose of stemming is to reduce words to their radical, for example, the word "belivies" will be transformed into "believ" (Perkins, 2014).

Bag-of-words
We apply bag-of-words as baseline, since it has been successfully employed in various classification tasks (Da Silva et al., 2014;Barbieri et al., 2017;Pak and Paroubek, 2010;Kouloumpis et al., 2011;Socher et al., 2013). We represent each message with a vector of tokens, selected using term frequency-inverse document frequency (TF-IDF) with quadrigrams, and min df = 1, max features = 3500, and ngram range = (1,4). In the Logistic Regression it was considered C = 10.0, while in the Support Vector Machine and Random Forest the hyperparameters were used by default.

Word embeddings
Word Embeddings (Bengio et al., 2003) is a supervised statistical language model trained using deep neural networks. The purpose of this model is to predict the next word, given the previous context in the sentence, so similar words tend to be always close. The vector presentation of words was a great advance in relation to the strategies based on bag-of-words. For the proposed task we apply the GloVe model (with 200 dimensions) by (Pennington et al., 2014), GloVe is based on a counting model, in which the vectors are derived from an array of co-occurrences used to extract statistical information about the corpus. With this model an array was generated through the simple arithmetic mean of the word vectors.

Challenges
Because of the need for high computational power to perform the task and the high dimensionality of the table, both in terms of number of attributes and number of rows, only a sampling of 10% of training data was used, this sampling reflects the distribution of real classes.

Results
In this section, we report the obtained results by our model according to the metric evaluation of the challenge, macro f1, precision and recall, accuracy, and f1 for all the emojis (Barbieri et al., 2018). Results are reported for five diverse configurations: (i) the system based on word embeddings and baf-of-words with Logistic Regression (LR); (ii) the system based on word embeddings and baf-of-words with Support Vector Machine (SVM); (iii) the bag-of-words system with Logistic Regression (LR); (iv) the bag-of-words system with Support Vector Machine (SVM); and (v) the bag-of-words system with Random Forest (RF). In Table 1 we show model's performances and in Figure 3 we present the predicted score for one of the 20 emojis.  The obtained results on the testing data indicate that word embedding together with bag-of-word produces the best F1, on the other hand the three configurations represented only by bag-of-word obtained their results close to the central work model (Word Embedding + Bag-of-Words). It is important to remember that only 10% of training data was used, such choice directly influenced the final result.

Conclusion
In this paper, we propose several configurations based on word embeddings and bag-of-words for the Semeval 2018 task 2, subtask 1. As base classifiers we use Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to predict emojis in tweets. Our best model got F1 of 21.497.
As future works we intend to explore the semantics of emojis more, as well as apply new word embeddings templates, such as Word2Vec (Mikolov et al., 2013), FastText (Joulin et al., 2016) and Doc2Vec (Le and Mikolov, 2014) with more computational resources.