EmotiKLUE at IEST 2018: Topic-Informed Classification of Implicit Emotions

EmotiKLUE is a submission to the Implicit Emotion Shared Task. It is a deep learning system that combines independent representations of the left and right contexts of the emotion word with the topic distribution of an LDA topic model. EmotiKLUE achieves a macro average F₁score of 67.13%, significantly outperforming the baseline produced by a simple ML classifier. Further enhancements after the evaluation period lead to an improved F₁score of 68.10%.


Introduction
The aim of the Implicit Emotion Shared Task (IEST; Klinger et al., 2018) is to infer emotion from the context of emotion words. The working definition of emotion for the shared task implies that emotion is triggered by the interpretation of a stimulus event (Scherer, 2005, 697), i. e. the cause of the emotion. Consequently, the data for the shared task have been compiled with the aim of including a description of the cause of the emotion. This has been accomplished by using distant supervision: The organizers collected tweets that contain exactly one of 21 emotion words belonging to six emotions (anger, fear, disgust, joy, sadness, surprise), where the emotion word has to be followed by that, because or when as likely indicators for a description of the cause of the emotion. The corpus collected this way comprises more than 190.000 tweets and is split into three data sets: 80% training, 5% trial and 15% test. The emotion words in the tweets are masked and participants of the shared task have to predict the emotion of the masked emotion word from its context.
EmotiKLUE, our submission to the shared task, is a deep learning system that learns independent representations of the left and right contexts of the emotion word, similar to Saeidi et al. (2016), who use n-gram representations for both the right and the left context around triggerwords in aspectbased opinion mining. Our intuition is that the distribution of the emotions is dependent on the topics of the tweets, therefore we train a Twitterspecific LDA topic model and explore different ways of combining the topic distributions with the left and right contexts in order to predict the emotions. EmotiKLUE is available on GitHub. 1

Related Work
Emotion detection has been an important topic in natural language processing, particularly in the subfield of opinion mining, for several years. The shallowest approaches deal with sentiment polarity detection, either classifying utterances into categories ranging from negative via neutral to positive, or regressing towards a score typically ranging from −1 to 1 (see, for example, Proisl et al., 2013;Evert et al., 2014). Further tasks involve the automatic computation of stances (in favor of vs. against) towards pre-specified topics (Mohammad et al., 2017). Predicting more sophisticated categories of emotion than in the task at hand has been a more recent phenomenon. Generally, the approaches can be classified into two groups, namely rule-based approaches on the one hand and the far more common machine learning approaches on the other.
We give a short list of related work here, for a more comprehensive listing see the task description (Klinger et al., 2018). A survey of emotion detection from text and speech is given by Sailunaz et al. (2018). For a linguistic analysis of implicit emotions see Lee (2015). An approach to implicit emotion detection based on textual inference is presented by Ren et al. (2017).
As an example for rule-based emotion detection we mention Udochukwu and He (2015), who use a pipeline approach based on the OCC-Model (Ortony et al., 1988), without emotion-bearing words.
More recent work deals with ML and deep learning approaches. Rout et al. (2018) use both unsupervised and supervised approaches with different machine learning algorithms such as multinomial naive bayes, maximum entropy, and support vector machines on unigram feature matrices and report F 1 -scores of above 99% when disambiguating tweets according to seven emotion categories. However, since their text data are selected via a keyword-filter containing exactly the words representing the emotion which in turn can be used as features by the machine learner at hand, their high accuracy values are unsurprising.
Other tasks, such as detecting the emotion stimulus in emotion-bearing sentences are more challenging; Ghazi et al. (2015) e. g. use a conditional random fields classifier and report F 1 -scores of up to 60% for finding the stimulus in their selfconstructed data set. Finally, Firdaus et al. (2018) use different latent features such as emotion and sentiment as input to predict user behaviour (e. g. the act of retweeting).

Data Preprocessing and Additional Data
The data sets released by the organizers of the shared task contain the full text of the tweets, with the emotion word, usernames and URLs being substituted by placeholders. We tokenize the text with the web and social media tokenizer SoMaJo 2 (Proisl and Uhrig, 2016) and convert it to lowercase.
In addition to the official data sets, we use two resources: ENCOW14 3 (Schäfer and Bildhauer, 2012;Schäfer, 2015) and an in-house collection of 114 million deduplicated English tweets (see Schäfer et al. (2017) for the deduplication algorithm), collected between February 2017 and June 2018. 4 We tokenize the tweets with SoMaJo (but not ENCOW14, which is already tokenized), mask usernames and URLs and convert the text to lowercase.

Representations derived through unsupervised methods
We use our in-house collection of tweets to create Twitter-specific word embeddings and topic models.
Using the Gensim 5 (Řehůřek and Sojka, 2010) implementation of word2vec (Mikolov et al., 2013a,b), we create four sets of embeddings for all words with a minimum frequency of 5: 100-and 300-dimensional vectors using the skip-gram approach and 100-and 300-dimensional vectors using the CBOW approach.
Our intuition is that the distribution of the emotion words depends on the topics of the tweets. To capture these topics, we use Gensim and create an LDA topic model (Blei et al., 2003) with 100 topics based on the most recent 10 million tweets in our collection (ignoring words that only occur once).

Additional Data for Pretraining
We compile an additional data set from EN-COW14 and our collection of tweets that we use to pretrain our model. To this end, we select tweets and ENCOW14 sentences with a maximum length of 110 words that contain a single emotion word from the following set of emotion words: afraid, angry, disgusted, disgusting, happy, sad, surprised, surprising. This list of emotion words was determined by a cursory glance at the official training data and happens to be a subset of the 21 emotion words used by the task organizers (which were only revealed after the evaluation period). Note that we do not restrict the contexts in which the emotion words occur, i. e. the emotion words do not have to be followed by that, because or when. After balancing the data, we have approximately 159.000 items per class.

Network Architecture
We experiment with three variants of a neural network architecture implemented using Keras 6 (Chollet et al., 2015) and visualized in Figure 1.
The word-level representations for the left and right contexts of the emotion word that are returned by the embedding layers are fed into   (Hochreiter and Schmidhuber, 1997;Gers et al., 2000): A left-toright layer for the left context from the beginning of the tweet to the masked emotion word, and a right-to-left layer for the right context from the end of the tweet to the masked emotion word. The hidden states of the two LSTM layers are concatenated. Now, we explore three variants of incorporating the 100-dimensional LDA topic distribution into the model: 1. We do not use LDA topics. The output of the LSTMs is fed to a dense layer, followed by a dropout layer and finally a softmax output layer.
2. We use LDA topics as features alongside the LSTM output. The LDA topic distribution and the output of the LSTMs are concatenated. The result is fed to a dense layer, followed by a dropout layer and finally a softmax output layer.
3. We use LDA topics as filter. The output of the LSTMs is fed to a dense layer to reduce dimensionality. The LDA topic distribution is fed to a softmax layer. The output of the two layers is combined using element-wise multiplication. The result is fed to the final softmax output layer. We train each model for a maximum of 20 epochs with a batch size of 160, using the Adam optimizer (Kingma and Ba, 2014) to minimize categorical crossentropy. If the validation loss (determined on the trial data) fails to improve for two consecutive epochs, training stops early.

Experiments
We have three different network architectures that differ in the way they use LDA topic distributions. We have four sets of embeddings that differ in size and training objective. And we have three options for the training data (only the official training data, only our additional data, or training on the latter and retraining on the former). In order to quantify the impact of the individual choices, we train and evaluate all 36 possible models. Results for models using skip-gram-based embeddings are shown in Table 1 and results for models using CBOWbased embeddings in Table 2. The evaluation metric used is the macro average of the F 1 scores of the six classes.
The exact numbers listed in Tables 1 and 2 should not be taken too seriously as they are subject to some small amout of random variation due  to differences in the initialization of the weights and the shuffling of the training data. 7 However, since all the individual options have been used at least nine times, we can still make some fairly reliable claims about their usefulness.
The most obvious observation is that the official training data lead to much better results than our additional data (+12.97 on average). This is probably due to two reasons: We only use a subset of the emotion words that have been used in the official data sets and, more importantly, we use all instances of the emotion words and not only those that are followed by something that is likely to be a description of the cause of the emotion. However, first training the model on the additional data and then retraining it on the official training data is benefitial (+1.96).
We can also see that word embeddings based on the skip-gram approach consistently outperform those based on the CBOW approach (+2.55). 300-dimensional embeddings are notably better than 100-dimensional embeddings (+1.19), an effect that is more pronounced for the skip-grambased embeddings (+1.57) than for the CBOWbased ones (+0.80). 7 The 95%-confidence interval for the performance of the add+train-skip300-ldafeat model on the test data is 67.12 ± 0.34, for example (estimated from 20 instances of the model).
The LDA topic distributions only have a positive effect when used as additional features alongside the LSTM output -and even then the effect is small and only positive for models using skipgram-based embeddings (+0.08) and negative for models using CBOW-based embeddings (−0.24). Using the LDA topic distribution as a filter usually has a negative effect (−0.76).
Consequently, for our submission to the shared task, we chose the second network architecture (LDA topic distribution as feature), used 300dimensional skip-gram embeddings and trained the model first on our additional data and retrained it on the official training and trial data. That model achieved a macro average F 1 score of 67.13 on the test data and took the tenth place in the shared task. For comparison, Klinger et al. (2018) report that human performance on this task is approximately 45%, the MaxEnt uni-and bigram classifier used as a baseline system achieved 59.88% and the best submission (Rozental et al., 2018) 71.45%.

Error Analysis
We present detailed error analyses in Table 3 in form of an extensive confusion matrix including label confusion per triggerword in the test data. We downloaded all available tweets used in the shared task via the Twitter API 8 to gain access to the actual triggerwords. For reasons of interpretability we report absolute marginal frequencies and relative frequencies of predicted label per real label and triggerword. 9 This corresponds to recall (true-positive-rate) for those cases where the prediction equals the true label and false-negativerate (FNR) per class for all other cases.
Recall is rather similar across labels: The highest rate can be achieved for joy (78%), the lowest is achieved for sad (59%). High FNRs have to be reported for confusing anger, disgust, and fear with surprise (11% and 10%), as well as sad with anger and disgust (each 11%).
Looking at the recall values per triggerword, explanations for the macro-values are not far to seek: 1. Performance is generally higher for those triggerwords that have been manually se-  Table 3: Confusion Matrix for the six predicted emotion categories (columns) for each real emotion and each triggerword (rows) in the test data lected by us for producing additional training data (see Section 3.3): angry (62%) shows higher recall than furious (57%), afraid (76%) and happy (79%) perform best in the fear and joy categories, respectively, and surprised and surprising (each 74%) are the best predictors for surprise.
2. Rare triggerwords generally lead to worse results. The most obvious example is sorrowful, which we only observed 28 times in the training data (8 times in the test data) and which yields 25% recall for predicting category sad, confusing it in half of the cases with joy. Additionally, cheerful and joyful (361 and 536 observations in the training data, re-spectively) perform lower than happy (22348 observations) -although admittedly happy had already been pre-selected for additional training as mentioned above.
3. Many confusions can also be explained from a psycho-linguistic point of view when looking at the actual corpus. Instances involving the triggerword disgusted e. g. are frequently categorized as anger by our system. Corpus evidence shows that these words are hard to disambiguate: • women are property of Father-In-Law?
• I wake up [#TRIGGERWORD#] because I know you doin me wrong but u dont think its nothing wrong with being in a verbal relationship with another gal It is hard to see how one could reliably predict the "real" emotion (disgust) in the above examples, since anger -as predicted by our system -seems to be an equally sensible guess. Similar instances can be found for other confusions, most notably when predicting anger in case of the triggerword depressed.

Post-analysis experiments
The analysis in the previous section has shown that our system performs better on the more frequent words that we used for compiling our additional data than on the less frequent words. Therefore, we recompile our additional data as described in Section 3.3 but for all of the 21 emotion words that occur in the official data. After balancing the data, this results in approximately 163.000 items per class. We take the model versions from Section 4.1 that are the basis for our submission and replace the additional data with the updated version. The new models (prefixed with "add2" in Table 4) improve on the old ones both when using only the additional data (+1.66) and when retraining on the official training data (+0.28).
It is also worth pointing out that so far we have not fine-tuned the hyperparameters of our model. As a first step in that direction, we try to use more units in the hidden layers and increase the size of all hidden layers to 300 units (models prefixed with "300-add" in Table 4). This boosts the performance both when using only the additional data (+2.03) and when retraining on the official training data (+0.85).
Combining the recompiled additional data and the larger hidden layers yields further improvements (models prefixed with "300-add2" in Table 4). The retrained model is approximately 1 point better than our submission and would have taken the eighth place in the shared task.
A further error analysis shows that the additional training data indeed yield the desired effect: Recall for category angry improves from 61% to 66%, largely due to better recall in the case of the triggerword furious (rising from 57% to 65%). Further improvements can be found in almost all categories, namely for fear (69% to 72%, especially frightened: 49% to 53%), joy (78% to 79%, with recall for joyful rising from 61% to 65% and for cheerful from 64% to 70%), and sad (59% to 62%, triggerword depressed up two points from 46% to 48%). However, the additional training data had an adverse effect on category surprise; here recall falls from 68% to 65%, with almost all triggerwords dropping a couple of points, the worst being surprised, falling from 74% to 69%.
Finally, we want to take a closer look at the contribution of the LDA topic distribution. To this end, we have trained 20 instances of the 300-add2+train-skip300-ldafeat and 300-add2+train-skip300-nolda models and have calculated the means and 95%-confidence intervals. As it turns out, both model variants perform identically on the trial data. On the test data, there are some minor differences but the performance means lie within one standard deviation of each other. This means that our choice of concatenating the LDA topic distribution of the tweet to the LSTM does not have a statistically significant result..

Conclusion
We presented EmotiKLUE, a topic-informed deep learning system for detecting implicit emotion. Our experiments showed that for this task skip-gram-based word embeddings outperform CBOW-based embeddings. Additional data, that -on their own -yield rather poor results, improve the performance when used for pretraining the model. LDA topic models, that we initially believed to have a small positive effect, turned out to not contribute significantly.
The error analysis shows that the objective as set in the shared task at hand is rather difficult: With many instances of tweets showing prima facie ambiguous emotions, it is unsurprising that even perfectly trained classifiers will not be able to achieve 100% accuracy when using the textual data alone.
Future work could nonetheless involve more experimentation with the hyperparameters of the network, e. g. number, size and activation of the hidden layers, choice of regularization strategy and optimizer, etc.
The software is available on GitHub. 10