The Dabblers at SemEval-2018 Task 2: Multilingual Emoji Prediction

The “Multilingual Emoji Prediction” task focuses on the ability of predicting the correspondent emoji for a certain tweet. In this paper, we investigate the relation between words and emojis. In order to do that, we used supervised machine learning (Naive Bayes) and deep learning (Recursive Neural Network).


Introduction
In the last few years, Social Media has evolved very fast, becoming at the moment a very important part of our daily life. There are several social networking platforms such as Twitter, Facebook, Instagram, WhatsApp, which were created in order to allow us to communicate with each other, to share our feelings or opinions related to different topics. Despite their differences or their purposes, each of these platforms shares one aspect: the use of emojis. From facial expressions to animals, objects or places, they are all used to communicate simple things or to enhance feelings and emotions. Due to the fact that their meaning is not always the same, processing emojis remains a challenge for the NLP researchers. From a language to another, for different cultures or depending on the user's sentiments, emojis meaning can vary a lot. Understanding their meaning depending on the context of use has a huge relevance in multiple fields, like: human computer interaction, multimedia retrieval, etc. Twitter Emojis are pictures usually combined with text in order to emphasize the meaning of that text. Although these pictures are the same all over the world, they can be interpreted and used in different ways, depending on culture differences. Despite their widely usage in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint.

Related work
Over the past few years, there has been an increased public and enterprise interest in social media. Therefore, analyzing emojis has become an important aspect for NLP researchers, because their meaning has remained for the time unexplored. Go et al. [9] and Castellucci et al. [6] used in their papers distant supervision over emotionlabeled textual contents in order to train a sentiment classier and to build a polarity lexicon. Aoki et al. [1] described in his research a methodology to represent each emoticon as a vector of emotions, while Jiang [10] proposed a sentiment and emotion classier based on semantic spaces of emojis in the Chinese Website Sina Weibo. In his research, Cappallo et al. [5] proposed a multimodal approach for generating emoji labels for images (Image2Emoji). Boia et al. (2013) [4] analyzed sentiment lexicons generated by considering emoticons, showing that in many cases they do not outperform lexicons created only with textual features. Barbieri et al. [2] tried to predict the most likely emoji a Twitter message evokes. They used a model based on Bidirectional Long Short-term Memory Networks (BLSTMs) with standard lookup, word representations and character-based representation of tokens. For the word representations they replaced each word that occurred only once in the training data with a fixed representation (out-of-vocabulary words vector) (similar to the treatment of word embeddings by Dyer et al. (2015)).

Data Set and Methods
In this Section, we present the data set format and the architecture we used to predict emojis. We implemented two main modules: first one is based on a Recurrent Neural Network (3.3.1) and the second one implements Naïve Bayes algorithm (Error! Reference source not found.).

Data Set
The corpus is formed from 500k tweets in English and 100K tweets in Spanish. The tweets were retrieved with the Twitter APIs, from October 2015 to February 2017, from United States and Spain. The dataset includes tweets that contain one and only one emoji from the 20 most frequent emojis. Data was split into Training Data (80%), Trial Data (10%) and Test Data (10%).
Data set is related to the 20 most frequent emoji of each language. In order to generate the training data, we used the tools given by organizers: a crawler for extracting the tweets and an emoji extractor. For each language, data is represented through two files: one file containing one tweet per line and the other file containing the corresponded emoji label.

Tweet Pre-Processor
The first step from the preprocessor module consists in cleaning up the data set (punctuation, stop words) in order to avoid noise in the implemented algorithms. This step consists in removing punctuation marks and links. We identify them by using the regular expression:

(([-\"'/`_%$&*+<>^•()=|¡・;:.,!?@#~]+)|([0-9]+))
We removed stop words and user mentions, but we decided not to eliminate the hashtag word because many tweets were made only by this kind of words. We removed instead the Hashtag sign and passed the words to the next step of preprocessing. For the last step, we used Stanford Tokenizer in order to obtain the list of tokens for each tweet. Then we replaced each word with the correspondent lemma using WordNet Dictionary. The words that didn't have a lemma were considered noise and we choose to eliminate them.

Recursive Neural Networks
Recursive neural network (RvNN) is a kind of deep neural network created by applying the same set of weights recursively over a structure, in order to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order. Given the proven effectiveness and the impact of recurrent neural networks in different topics (sentiment analysis, etc.), we intend to build an emoji prediction model based on a Long Short-Term Memory Network (LSTM).
For each subtask, we divided the train data set into twenty smaller train sets, one for each emoji label. Each train subset contains tweets with only two labels (classes). For instance, the train set for label "0" contains tweets with classified with 0 or !0. We then created a model for every emoji label a trained it with the correspondent train dataset.
In order to unify the models output, we run the test dataset on each one of these model. Then, based on the probabilities of each classified label, we chose the emoji with the highest score.

Naïve Bayes Classifier
Naïve Bayes it's a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).  P(c) is the prior probability of class.  P(x|c) is the likelihood which is the probability of predictor given class.  P(x) is the prior probability of predictor.
Naïve Bayes Classifier is written in JavaScript. We used several libraries, applications already made to make the best and smarter module. Train data receives as input ex: "en_train.txt" and "en_train.labels", after the files are read, it reads files line by line, makes a tweet array, then it assigns to each tweet from the array a number from the label file, then it calls the Naïve Bayes algorithm implementation. Finally, the test data is entered and the values are generated for them. It represents an interface that easily generates a label according to the introduced text. Moreover, it does not change values if the same tweet is entered multiple times. The output format is txt and a label is generated for each line from test data.

Discussions
Based on the things observed during the project implementation, we think that a possible improvement consists in trying to minimize the tweets noise. For instance, many words from tweets have duplicated letters (e.g. "aaaaand"). Eliminating those duplicated letters till the word has a correspondent lemma could significantly reduce the noise.

Conclusions
Emojis are very used on social sites, but not much is known about their use and semantics. However, it has been noticed that emojis are used in different communities. In this paper, we tried to predict the correspondent emoji for a given tweet using a deep learning module based on a Recurrent Neural Network and a Naïve Bayes module. The results for the Naïve Bayes implementation were better than those from the network module.