PickleTeam! at SemEval-2018 Task 2: English and Spanish Emoji Prediction from Tweets

We present a system for emoji prediction on English and Spanish tweets, prepared for the SemEval-2018 task on Multilingual Emoji Prediction. We compared the performance of an SVM, LSTM and an ensemble of these two. We found the SVM performed best on our development set with an accuracy of 61.3% for English and 83% for Spanish. The features used for the SVM are lowercased word n-grams in the range of 1 to 20, tokenised by a TweetTokenizer and stripped of stop words. On the test set, our model achieved an accuracy of 34% on English, with a slightly lower score of 29.7% accuracy on Spanish.


Introduction
The way people communicate with each other has changed since the rise of social media.Many people use visual icons, so-called emojis, to complement their social media messages.Emojis are frequently used on online platforms like Twitter, Facebook, Instagram and WhatsApp.The wide use of emojis in social media means that processing these emojis can be relevant for NLP applications dealing with social media data.
Social media text has been studied in the field of author profiling, but only recently the interest in the research on emojis started growing.Author profiling is used in different fields such as marketing, forensics, psychological research and medical diagnosis.Author profiling focuses on stylometric features, and since this new popular way of expressing meaning by using emojis has become mainstream, its important to research if and how this data can be used in addition to the textual data.It could be possible that emojis reveal a great deal about the author's gender, location, age or other characteristics.
We describe our approach to SemEval-2018 Task 2 on Multilingual Emoji Prediction (Barbieri et al., 2018) in this paper.We will discuss the features, the machine learning methods we used and analyse the performance of our best method.

Related Work
Author profiling tasks are focusing more and more on social media.Oftentimes, the data that is provided is data obtained from social media platforms Rangel et al. (2017).However, research on emojis is more scarce.Some research on emojis is done by Barbieri et al. (2017).They investigated the relation between words and emojis, and found that neural models outperform baseline bag-of-words models as well as humans when predicting which emojis are used in tweets.Xie et al. (2016) researched automatic emoji recommendation using neural networks.Emojis can express more delicate feelings beyond plain text, and suggesting valid emojis to users of messaging systems can enhance user experience.They approached this problem with neural networks, and they found an Hierarchical-LSTM system significantly outperformed all other LSTM approaches.
Zhao and Zeng (2017) also looked at emoji prediction.The task described in this paper is very similar to the SemEval task.They achieved an accuracy of 40% using a CNN.As features they used the Twitter GloVe embeddings 1 .Since they worked with a noisy dataset they constructed themselves and we are provided with a clean dataset, a similar approach might yield high scores.
Author profiling on tweets is not new.At PAN 2017 (Rangel et al., 2017), Basile et al. (2017) were able to achieve a score of 82% on gender prediction of English tweets.They approached the task with an SVM using combinations of character and tf-idf word n-grams.This yields good results for predicting gender, and can provide a good basis for an emoji prediction system.
In the light of this task, sentiment analysis might be helpful.The sentiment of a tweet might point the classifier in the right direction.Mohammad et al. ( 2013), Han et al. (2013) and Da Silva et al. (2014) all looked into sentiment classification of tweets using machine learning algorithms.Da Silva et al. (2014) achieved an accuracy of 84.85% on predicting sentiment on a Tweet dataset using an ensemble where SVM, Random Forest and Multinomial Naive Bayes were combined using majority voting.It might be fruitful to try some features and methods used in these papers to see if sentiment can be a distinctive feature for emoji prediction.Unfortunately, we did not manage to experiment with these features.

Data
The dataset used for this task was provided by the organizers of the SemEval task, and is derived from Twitter and only includes Spanish and English tweets from respectively Spain and the United States.An overview of the emojis in the dataset is shown in Tables 1 and 2.

Method
For the task of emoji prediction, we explored a neural network approach and an SVM approach.We established a basic machine learning model per approach and improved on these models for both Spanish and English development dataset.With this approach, we aim to develop a robust model that is able to predict the emojis for both the Spanish and the English dataset accurately.
Architectures we tried for the neural network approach ranged from a simplistic sequential model with a few hidden layers to a stacked LSTM model with word embeddings.
The highest results for our neural network approach were achieved by a sequential neural network model.Our first layer was a 200dimensional embedding layer, using the GloVe Twitter embeddings (Pennington et al., 2014).Secondly, we used an LSTM layer.After the LSTM layer, our model included a Dropout of 0.2 (Srivastava et al., 2014).The output layer was a dense layer with the sigmoid activation function.Our model used a categorical cross-entropy loss and was optimized by the Adam optimizer (Kingma and Ba, 2014).We used zero masking, 20 epochs and a batch size of 128.Other parameters were left to the Keras defaults.
By establishing a basic SVM system, we tried to improve the system with divergent features.Our basic model consisted of word and character ngram features.Improvements on this model were applied by using different kinds of preprocessing, tokenization, stemming and POS-tagging methods.We tried tokenization with the NLTK Word Tokenizer and the NLTK Tweet Tokenizer For stemming, we tried the Porter and Snowball stemmer, also from NLTK.Both of them did slightly decrease the accuracy of our system.The POS tagger we tried was NLTK's default POS tagger.
After trying several setups for both systems on the development dataset, we concluded that our SVM approach was the most accurate for both the English and Spanish tweets.
For our best SVM system, we found that some special characters and punctuation had to be removed.Besides, we replaced the Twitter URLs with the placeholder 'URL' and we substituted '. . .', which was a reference to Instagram, with the placeholder 'INSTAGRAM'.Lastly, we applied a method to reduce each character sequence to a maximum sequence of three characters.E.g., if a user uses the word 'wooooooooooow', we normalize it to 'wooow', so the textual input to the system is less sparse.
The SVM system which yielded the best results on the development set, used the NLTK Tweet tokenizer and merely one feature, namely a tfidf word vectorizer with a word n-gram range of (1,15), no lowercasing, removing of English stopwords for both the English and Spanish dataset (unconventional, but improved the scores) and a minimum document frequency of one.Our model was trained with sklearn's SGDClassifier2 with a hinge loss and a maximum number of 50 iterations.All other parameters were left to the sklearn defaults.In addition to the SVM and LSTM, we tried an ensemble approach that combined both.Our assumption was that both systems performed slightly better or worse in different aspects.By combining our best SVM and LSTM, we tried to achieve a higher accuracy.When the LSTM system is 95% certain about a label prediction, our ensemble system takes this label as the predicted label.When the LSTM is less certain, the ensemble system takes the label predicted by our SVM system as the predicted label.This threshold was chosen after a short trial of different thresholds, where the 0.95 provided the best results.Yet, it turned out that combining both systems yielded a slightly worse accuracy than our best SVM system alone.

Results
The baseline results, obtained by always predicting the most frequent label from the training set, are presented in Tables 4 and 5 The results obtained on the development set are presented in Table 3, with the highest scores, i.e. those achieved by the best systems, are printed in bold.
The results of the final SVM model that we submitted on the test set are presented in Tables 4  and 5, for English and Spanish, respectively.The scores on individual classes (==emojis) are pre-sented in Tables 1 and 2.
Our final system achieves a macro F1-score of 22.86% for English and 15.86%.In order to provide additional insights into the system's performance, the confusion matrices for English and Spanish on the test set, are presented in Figure 1 and Figure 2.

Discussion & Conclusion
In the confusion matrices, the diagonal lines of correct predictions can be seen.However, as also reported in the paper of Zhao and Zeng (2017), there is also a bias towards predicting the most frequent emojis.For the English tweets, the Christmas tree emoji was predicted most accurately.This is understandable, since this is an emoji that is mostly used in very distinct circumstances.For emojis 3, 8, 9 and 13 this is not the case.They were often incorrectly predicted as emoji 0 (a red heart), which is explainable by the fact that all these emojis relate to love and hearts.For the Spanish tweets, the same issues can be seen with similar emojis.
In this paper, we explored two approaches (an LSTM and an SVM) and a combination of both for predicting emojis of English and Spanish Tweets.Ultimately, the SVM classifier achieved the high- est results: F1-score of 22.86 for English and 15.86 for Spanish.Compared the other participating groups, the results were in the mid-range.These results showed that our system ranks 26 th out of 49 for English and 10 th out of 22 for Spanish.The results on the test set were lower than what we achieved on the development set.This is possibly due to the fact that there seemed to be an overlap between the training set and the development set.This would cause the classifier to be able to make more correct predictions, because it has seen the exact same tweets before.

Figure 1 :
Figure 1: Confusion matrix for predicted and gold labels on the English test set.

Figure 2 :
Figure 2: Confusion matrix for predicted and gold labels on the Spanish test set.

Table 2 :
Macro F1-score per emoji on test-set for Spanish.

Table 3 :
Macro F1-score for various system setups on the development sets for English and Spanish.

Table 4 :
Macro-averaged F1, Precision & Recall and Accuracy for English on the test-set.

Table 5 :
Macro-averaged F1, Precision & Recall and Accuracy for Spanish on the test-set.