DUTH at SemEval-2018 Task 2: Emoji Prediction in Tweets

This paper describes the approach that was developed for SemEval 2018 Task 2 (Multilingual Emoji Prediction) by the DUTH Team. First, we employed a combination of pre-processing techniques to reduce the noise of tweets and produce a number of features. Then, we built several N-grams, to represent the combination of word and emojis. Finally, we trained our system with a tuned LinearSVC classifier. Our approach in the leaderboard ranked 18th amongst 48 teams.


Introduction
Emojis are used in everyday life to express words or feelings of microblogging users. They are commonly placed at the end of a sentence or alone. In this paper, we show how our emoji prediction framework was applied to SemEval-2018 Task 2 (Multilingual Emoji Prediction) (Barbieri et al., 2018), specifically on Subtask 1 (Emoji Prediction in English).
In the last few years, many studies concentrated on emoji prediction and analysis. The prediction of emojis, the connection of emojis and words, and their separation from content-based tweet messages, based on Long ShortTerm Memory networks (LSTMs), was examined by Barbieri et al. (2017). The combination of emojis and sentiment was investigated by Novak et al. (2015), who developed the first emoji sentiment lexicon and created a sentiment map of the 751 most frequently used emojis. The study of Barbieri et al. (2016) tested several skip-gram word embedding models to measure the difference in performance between machine-learning models and human annotation. Na'aman et al. (2017) analyzed the viability of a trained classifier to differentiate between those emojis utilized as semantic substance words and those utilized as paralinguistic or emotional multimodal markers. Miller et al. (2017) investigated the hypothesis of previous works that emojis in their regular textual contexts would generously reduce and lead to miscommunication, but they found no such evidence; the potential for miscommunication appeared to be the same.
The rest of this paper is organized as follows. Section 2 describes the architecture of our system and the dataset. In Section 3, we discuss the various parameters that were used to fine-tune the system, and present the performance of our framework. In Section 4, we lay out our main conclusions and research issues for further investigation.

System Description
The principal goal of SemEval-2018 Task 2 -Subtask 1 was emoji prediction in English. The framework we utilized consists of a bag of-words representation and N-gram extraction. We used the popular machine learning tool for Python, called Scikit-Learn (Pedregosa et al., 2011).

Preprocessing
For the preprocessing of tweets, we were guided by the results of our previous research (Effrosynidis et al., 2017). We used the effective combination of the following techniques: • Replace URLs and User Mentions to the tags 'URL' and 'AT USER', as the majority of tweets on Twitter contain a URL and mentions which are considered noise.
• Replace Contractions, as it reduces the dimensionality of the problem and improves speed and accuracy according to the abovementioned paper.
• Remove Numbers, because they do not contain any sentiment. • Replace Repetitions of Punctuation, which merges in the same feature the intensity of emotions. For example, if we find more than two consecutive exclamation, question or stop marks, we replace them with a single one.

Dataset
The training and testing datasets were provided by the organizers. The training set contained approximately 500, 000 tweets, where each tweet contained a single emoji, before they removed it and set it as class label. That emoji is used as the class label for the particular tweet. We extracted various statistics for the dataset as it can be seen in Table 1. Some class labels contain more sentences per tweet, like label 10 ( ) and 0 ( ). We also observe that the emoji has on average much fewer hashtags per tweet, while the emoji has much more. All the other emojis range within reasonable limits. The emojis with labels 7 ( ) and 3 ( ) are expressed using more words on average, while the emojis 10 ( ) and 11 ( ) are expressed with fewer words.
All the above observations are important to un-derstand the dataset and how people are using each emoji. One can use these statistics in order to create more features and test them to see the changes in classification accuracy. For example, one can count the words of each new sentence for classification, and compare them with the ones derived from the training dataset.
In our study, we compared several machine learning algorithms (Ridge, Logistic Regression, Passive-Aggressive, and Linear SVC), and three different word to vector representations (tf-idf Vectorizer, count Vectorizer, and hashing Vectorizer). The macro F-measure score was computed for 10-folds cross-validation on the training set and on the trial set while using the training set for training. We employed word n-grams and character n-grams (n ranging from 1 to 4), with the latter ones performing poorly.

Experimental Results
In this section, we describe the different classifiers and vectorizers used and present our results.  Table 2: Results per classifier and vectorizer using 10-fold unigrams.

Classifiers
In order to gain a better perspective on the problem, we trained four different classification algorithms. We test each classifier comparing their macro F-measure score. We choose LinearSVC, because of the stability we noticed in the results it returned. Below we discuss every classifier: • Ridge: an algorithm belonging to the Generalized Linear Models family. Text classification problems tend to be quite high dimensional, and high dimensional problems are likely to be linearly separable; this is one reason why Ridge performs quite well.
• Logistic Regression: despite its name, it is used for classification and fits a linear model as well. In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme.
• Passive-Aggressive: belongs to a family of algorithms for large-scale learning, which does not require a learning rate and includes a regularization parameter C. On the one hand, the aggressive mode of the algorithm means that if an incorrect classification occurs, the model updates to adjust to this misclassified example. On the other hand, the model stays unchanged in every correct classification and this is the passive behavior of the algorithm (Crammer et al., 2006).
• Linear SVC: the purpose of this algorithm is to fit the data by finding a set of hyperplanes that separate space into areas representing classes. The most efficient way is considered to be the max distance between data points and the hyperplane.

Vectorizers
Nowadays, one can find many vectorizers to use in order to extract features. We used the following three vectorizers provided by Python's Scikit-Learn library (Pedregosa et al., 2011), in order to transform tweets into vectors of features.
• tf-idf Vectorizer: a vectorizer which scales the term frequency counts in each tweet by penalising terms that appear more frequently across the dataset.
• count Vectorizer: converts the collection of tweets to a matrix of token counts.
• hashing Vectorizer: a vectorizer which applies a hashing function to term frequency counts in each document. This vectorizer leads to a sparse matrix holding token occurrence counts (or binary occurrence information).
Each vectorizer we used is efficient under certain circumstances. In addition, we noticed that the combination of the vectorizer and classification algorithm is crucial for our problem. In our work, as we can see in Table 2, the combination of countVectorizer and Logistic Regression leads to the best result. However, the tfidfVectorizer achieves greater results than the countVectorizer and the hashingVectorizer in the majority of the algorithms; for this reason, we proceeded with the tfidfVectorizer.

Evaluation Results
We evaluate the performance of our system with the macro F-measure score. The macro F-measure score gives equal weight to each emoji category, regardless of its class size. The F-measure per emoji class is the harmonic mean of the precision and recall of the class: The macro-average F-measure score is obtained by taking the average of F-measure values across emoji classes: where M is the total number of classes.
In Table 3 we present the macro F-measure score of tfidfVectorizer combined with LinearSVC classification algorithm. In the first column, the results of 10-folds cross validation on the training set are presented. In the second column we present the results when training with the training data and testing with the trial data. As it can be seen, four-grams performance on trial data has the highest value, but trigrams perform better on 10folds cross-validation. This is the reason we used trigrams to train our model for the submitted runs.

Conclusions
In this paper, we presented the framework we used to participate in the SemEval-2018 emoji prediction competition. We used a tfidfVectorizer combined with a LinearSVC classification algorithm, employing word tri-grams, to train our model. Our team ranked in the 18th place among 48 teams. For future work, it would be interesting to test Neural Network approaches, to use emoji sentiment lexica (Novak et al., 2015), or additionally include more features. Furthermore, it would likewise be intriguing to investigate the miscommunication of emojis in their natural textual contexts.