BrainT at IEST 2018: Fine-tuning Multiclass Perceptron For Implicit Emotion Classification

We present BrainT, a multi-class, averaged perceptron tested on implicit emotion prediction of tweets. We show that the dataset is linearly separable and explore ways in fine-tuning the baseline classifier. Our results indicate that the bag-of-words features benefit the model moderately and prediction can be improved with bigrams, trigrams, skip-one-tetragrams and POS-tags. Furthermore, we find preprocessing of the n-grams, including stemming, lowercasing, stopword filtering, emoji and emoticon conversion generally not useful. The model is trained on an annotated corpus of 153,383 tweets and predictions on the test data were submitted to the WASSA-2018 Implicit Emotion Shared Task. BrainT attained a Macro F-score of 0.63.


Introduction
Our task is to predict emotions of tweets in a dataset where words explicitly mentioning the emotion are masked (Figure 1 and 2). Following the definition of Ekman (1992), there are six "basic" emotions, these tweets have the labels joy, fear, surprise, disgust, anger or sadness. As the model has no access to the explicit emotion word, it has to detect it from its implicit context, i.e. the situational or causal description of the event. This aspect of the task makes it comparable to centre word prediction from context words.
Twitter language distinguishes itself by a heterogeneous variety of internet vernaculars, abundance of abbreviations, emojis, hashtags and deviation from conventional spelling, grammar, syntax and lexicon. This makes recognition of emotions intricate even for human readers as evident from the noticeably low inter-annotator agreement reported by Balabantaray et al. (2012) or the  "testing" of the IEST dataset on English nativespeakers which resulted in an F-score of 0.45 (Klinger et al., 2018).

Related Work
Previous research on sentiment analysis and emotion analysis of Twitter data often disagrees on the benefits or disadvantages of the various approaches, algorithms and feature models.
In Psomakelis et al. (2014) linear and multilayer classifiers are evaluated on sentiment analysis of tweets and is found that learning-based approaches outperform lexicon-based approaches explaining this chiefly by the lack of contextual information that lexical entries (such as polarity scores) express in the unigram model. Kouloumpis et al. (2011) found a mixed feature set of unigrams and n-grams beneficial for sentiment analysis, but found that adding POS-tags to the feature set drops the model's performance and questioned its usefulness specifically on Twitter data. Aston et al. (2014) observe that the voted perceptron performs quite well using only character n-grams and propose a feature-reduction method that dramatically decreases runtime with-out compromising performance. Conversely, Balabantaray et al. (2012) evaluate a "greedy" feature model, including n-grams, POS-tags, bigram POS-tags, dependency tags, affection labels etc. Interestingly, the authors of this paper add a seventh class, no emotion to the six basic emotion classes and find that the multi-class SVM attains a high accuracy score with a panoramic feature model.

Multi-class Perceptron
We design our model following the "one-againstall" approach described in Allwein et al. (2001) by reducing the multi-class prediction task into k = 6 binary classification problems. We add weight vectors for each emotion class (w joy , w f ear , ...). Prediction is made by assigning each tweet vector x i the label that gets the highest confidence: For each incorrect prediction, the model is updated by adding the tweet vector x i to the true label's weights y i and subtracting it from all the other weights: ifŷ = y i : After our first experiments we upgraded our model to the averaged perceptron as defined in Collins (2002) and as discussed in Kazama and Torisawa (2007). Doing so, we set the final weights to be the average of all updated weights during training. Additionally, we randomize the order of tweets before each training epoch to reduce overfitting.

Features
Our feature set consists of unigrams, bigrams, trigrams and what we call skip-one-tetragrams. We use a combination of n-grams as our feature set and optionally add POS-tags.
The unigrams are modified depending on the selected preprocessing mode. This can be either reductive (surface word is reduced to its stem or lowercased, stop words and punctuation are removed, emojis and emoticons are replaced by labels, numbers are replaced by N U M tag) or additive in which case stems, labels and tags are added to the feature set alongside the surface form. Bigrams and trigrams are added to the feature set as they are. Tetragrams are duplicated and respectively the second and the third tokens 2 are replaced with SKIP . We expect that this will generalize phrases that only differ in one token. E.g., "he loves red apples" with skip-one is "he loves SKIP apples" and will match with "he loves green apples" in another tweet.
We calculate the feature values using one of the following measures: binary (0 or 1), count, frequency or tf-idf.

Dataset
The dataset we use is provided by the WASSA 2018 Implicit Emotion Shared Task 3 . It is a corpus of 153,383 tweets annotated with distant supervision where each tweet originally contained one of the six emotion words (joy, fear, surprise, disgust, anger, sadness) or their synonyms. These words are masked in the dataset, as are usernames and URLs. The dataset is described in detail in Klinger et al. (2018). We use a test set consisting of 28,757 tweets, provided by the IEST as well.

Preprocessing
We tokenize and normalize tweets using methods that allow for the orthographic anomalies of tweets (e.g., missing space between words and punctuation marks; use of punctuation marks as emoticons). Tokens are labelled by their type (word, punctuation, numerical, emoji, emoticon, hashtag or URL). Depending on our choice between the reductive or additive modes, word tokens are replaced or complemented with stems, all other types by a label or tag. For example, the emoji and the emoticon :)))) both are replaced or complemented by laughing 4 . Numbers like e.g. 1948 are replaced or complemented by N U M .
We also add counts of word classes in the tweet using the NLTK 5 part-of-speech tagger. Option-  ally stopwords and punctuation marks can be removed and tokens can be lowercased. These preprocessing options are only applied to unigrams, since they would otherwise disturb the word order in n-grams.

Experimental Setup
We evaluate our model on the test data described in section 4.1. We consider Macro F-score as the evaluation metric and calculate Precision and Recall scores for each emotion class. We run our experiments with learning rates ranging from 0.1 to 0.5, but choose for 0.3 in later experiments as the model seems to converge slightly better in this case. For the initial model we set the number of epochs T = 150, but with averaging of the weights, T = 50 seems reasonable as the learning curve plateaus already after 30-35 epochs. During each epoch we calculate the accuracy of the predictions on the train data (we refer to this measure as Convergence or Conv).
Additionally, after each epoch the model is evaluated on the test data whereby the weights are not adjusted so the test data remains unseen. With these two measures we can track how the model adapts to the train data in comparison to its performance on the test data.

Results
We conduct four groups of experiments in increasing complexity of the feature set.
Group 1. First, we test the "vanilla" perceptron with unigrams and with minimal preprocessing (only tokenization). We try all four vector value calculations, but since frequency attains the highest score, we choose only that one for the next experiments. Results of this group of experiments are shown in table 1.
Group 2. We then update our model to the averaged perceptron and shuffle tweets before each epoch. This raises the F-score from 0.44 to 0.52. Subsequently we evaluate the model with more advanced preprocessing options. Both reductive and additive modes are considered. Results of Group 2   experiments are included in Tables 2 and 3. Since the impact of these options is either negative or positive but negligible, we choose for no unigram preprocessing options in the next experiments. Group 3. In Group 3 of the experiments we incrementally add bigrams, trigrams, skip-onetetragrams and POS-tags to the feature set (Table  5 and Figure 3).
Group 4. Finally, we repeat the experiments of Group 3 non-incrementally. Table 5 shows the results.

Discussion
We observe a strong improvement of the averaged perceptron with shuffling over the baseline perceptron. Predictions get better as more n-grams are added to the feature set, which is self-evident as they capture more contextual information. The learning curve converges on the training data after trigrams are added, which indicates that the dataset is linearly separable.
As it was found by Saif et al. (2014)   confirm that classic stopword filtering decreases performance and observe that similarly lowercasing, punctuation removal, stemming and emoji/emoticon conversion have a negative or neutral impact.

Future Work
The model and approaches described in this paper can be improved in two directions: enhancing the feature set and addressing the limitations of the multi-class perceptron. In the "one-againstall" model the output of each classifier is treated as a confidence measure, for a more precise prediction this score can be calibrated into probability. As demonstrated in Figure 4, models trained on different feature sets show different strengths and weaknesses in their predictions. This disparities can be exploited by adding "redundant" classifiers for the same emotion class and train them differently. A final prediction can be made based on a simple majority vote or a distance measure between the individual predictions. As described in Garcia Cifuentes (2009), this can improve the models performance. In this scenario, the preprocessing options described in 4.2 could also prove to be helpful. We would also like to try other multi-class reduction approaches on the same implicit emo- tion prediction task, such as "all-pairs" or "errorcorrecting code", both known to perform better than the "one-against-all" approach (Allwein et al., 2001).