Manchester Metropolitan at SemEval-2018 Task 2: Random Forest with an Ensemble of Features for Predicting Emoji in Tweets

We present our submission to the Semeval 2018 task on emoji prediction. We used a random forest, with an ensemble of bag-of-words, sentiment and psycholinguistic features. Although we performed well on the trial dataset (attaining a macro f-score of 63.185 for English and 81.381 for Spanish), our approach did not perform as well on the test data. We describe our features and classi cation protocol, as well as initial experiments, concluding with a discussion of the discrepancy between our trial and test results.


Introduction
Written digital communication is increasingly pervaded by the use of emoji. Classic NLP systems are not well geared to handle them. Linguists are still working out how to treat them (Stark and Crawford, 2015;Danesi, 2016). Even their users may disagree on meaning (Tigwell and Flatla, 2016;Miller et al., 2016). A simple approach could be to ignore all emoji and concentrate on the words of a text, however this approach may miss valuable meaning that can be obtained by treating the emoji as semantic units.
The emoji prediction task (Barbieri et al., 2018(Barbieri et al., , 2017, encourages research into the creation of text classification systems which can identify which emoji was present in a tweet. This could lead to automated suggestion systems for emoji, as well as improving the NLP communities understanding of how to deal with emoji computationally.

Data Acquisition + Preprocessing
The dataset was compiled between October 2015 and May 2016 (Barbieri et al., 2018). Training, trial, and test data emerge from a 80:10:10 split based on chronological order. We followed the  organisers instructions to obtain the training data, however we were only able to extract 491,486 tweets as some had been removed by their authors. We tokenised the tweets using the NLTK tweet tokeniser (Bird et al., 2009), but did not perform any further normalisation. 491 3 Features

Word-Class Occurrences
We created a set of features that describe which words occur with each emoji. We created a map describing how often each token occurred alongside each class. Let V be the vocabulary in terms of tokens. Let C be the number of total classes, where each class represents one emoji. We created a matrix M with size |V | × |C| such that each element M i,j indicates the number of times that token V i occurs with class C j . This allowed us to see whether one token occurred mostly in the context of one or two classes, or whether it occurred with similar frequency across all classes. This metric is similar to document frequency in information retrieval.
To further improve our metric, we applied a normalisation transformation to the rows (scaling each row by the total size of the row): This method favoured lower frequency terms (i.e., a hashtag that occurs only a few times with one emoji), so we applied a further transformation to multiply each row by the log frequency of occurrence of the token: These features produced intuitive results. The top words for a few select classes are as follows ( : love, heart, my, family; : sunglasses, shades, cool; : christmas, merry, #christmastree) These features are at the token level, however our classification labels are at the level of the sentence. To convert these features to the sentence level, we used two strategies: average and max. We calculated the average vector as the mean of all token vectors in a tweet. We calculated the max vector by taking the highest value across all tokens for each class. This led to 40 features (20 for average and 20 for max).

Sentiment
We employed Vader (Gilbert, 2014), a lexiconand rule-based sentiment detection system to de-rive a set of sentiment features. Vader fashions features, at sentence level, for positive, neutral, and negative polarities ranging from 0 to 1 and representing intensity. It also produces a combined sentiment score, with values between -1 (negative) and 1 (positive), where values in [−0.5, 0.5] denote neutrality.

Psycholinguistic Features
We used the MRC psycholinguistic norms (imagery, concreteness, familiarity, meaningfulness, age of acquisition) (Coltheart, 1981) as token level features. These were averaged to give tweet level features in our classification scheme.

LIWC
We used the latest version of the Linguistic Inquiry Word Count (Tausczik and Pennebaker, 2010) system, LIWC2015, to produce a large set of features, at sentence level, concerning emotional, cognitive, and structural components derived from the texts. As shown in Table 2, our experiments with those features, arranged into different subsets, did not produce any significant improvement; therefore, we decided not to include those in our submissions.

Results
We performed subset analyses to determine the best feature grouping. In Table 2, we show our results for different feature sets when training on the training data and testing on the trial data.
We also optimised the number of trees in our random forest, finding 225 to be the best value for this parameter. Table 3 shows the detailed classification report (precision, recall, F1, and support, by class), and Figure 1 displays the confusion heatmap for our best submission on the English test dataset.
Our system ranked 24 th , with a macro-averaged F1-score of 24.982 (n=48, median=23.919, min=2.038, max=35.991, Q1=18.278, Q3=28.410).On the Spanish challenge, our best submission (using only the average class-occurrence features) ranked 8 th , with a macro-averaged F1-score of 16.338 (n=21, median=14.912, min=3.896, max=22.364, Q1=10.892, Q3=16.696) (see Table 4 and Figure  2 for detailed performance). For lack of space, we restrict our subsequent error analysis and findings  The F1-score on the test data was much lower than that on the trial data (63.185). We hypothesise that this discrepancy might be largely due to (1) our system overfitting the training data and to (2) a test dataset whose class distribution and discriminant features differ in some measure from those of training and trial. Figure 4 shows the class (i.e., emoji ranks) distributions on trial and test data. With respect to training (omitted here for brevity) and trial data, the shape of the distributions match al-   most perfectly. Also, to a large degree, they are rank-preserving. 1 This is in contrast to the class distribution of the test data, which is not rankpreserving, particularly for those labels in the long tail (i.e., below the three most frequent). From the classification report (Table 3) and the confusion heatmap (Figure 1) on the test data, one could infer, firstly, that our system revealed a propensity for predicting the most frequent emoji, particularly , , and (accounting for about 40% of the data), which can be noticed from the consistent high values on the three left-most columns of the heatmap. Consequently, those within the surroundings of the peak of the class distribution, almost consistently, had recall significantly higher than precision.
For the majority of lower-support emoji, the system had a hard time in separating classes and quite frequently opted for higher-support ones. Secondly, it conflated classes into groups which, intuitively, could be seen as clusters of semantically-similar emoji, taking into account aspects such as emotions (e.g., joy), concepts (e.g., Christmas tree), and occasions (e.g., Christmas), to mention a few.
For instance, most of those associated with affection, elation, and other positive emotions and emotional states (e.g., , , , , ) presented extremely low recall and, frequently, were misclassfied as . As an example, had a recall of 4.18%, with about 64% of its tweets predicted incorrectly as .

Conclusions
We presented a system for the prediction a single emoji, out of a set of the twenty most-frequent, for Twitter datasets for (1) English and (2) Spanish. Our best model was based on a random forest (n=225) employing an ensemble of (a) max- . The x-axis shows the classes (i.e., the emoji ranks in Figure 3), and the y-axis represents support (i.e., normalised frequencies) and mean-aggregated normalised word-class occurrences, (b) sentiment and (c) psycho-linguistic features.
Our scores on the test data were significantly lower than those on the trial data, and we postulated that reasons for so were (1) a random forest that overfitted the training data and (2) large variance between trial and test data. It is worth investigating to which extent, and how, different periods of time explain that variance. For example, trial and test might have captured different, emerging trending topics and events; reflect drift in emoji usage; among others. It is reasonable to assume that, given the nature and the sparsity of the data, more representative samples might require much larger number of instances (say, billions of tweets) and time periods covered. F1-scores were consistently low for all participants, which demonstrates the difficulty of the task. We are conscious that idiosyncrasies of Twitter-specific data (e.g., data sparsity, neologisms, informality, lack of grammatical structure) make it all more problematic, and some of our current research involves devising and incorporating features to address those challenges.
We believe it would be fruitful to investigate evaluation metrics that, rather than all-or-nothing (e.g., misclassification rate), reflect the semantic similarity (or distance) between labels and predicted classes.