Disney at IEST 2018: Predicting Emotions using an Ensemble

This paper describes our participating system in the WASSA 2018 shared task on emotion prediction. The task focuses on implicit emotion prediction in a tweet. In this task, keywords corresponding to the six emotion labels used (anger, fear, disgust, joy, sad, and surprise) have been removed from the tweet text, making emotion prediction implicit and the task challenging. We propose a model based on an ensemble of classifiers for prediction. Each classifier uses a sequence of Convolutional Neural Network (CNN) architecture blocks and uses ELMo (Embeddings from Language Model) as an input. Our system achieves a 66.2% F1 score on the test set. The best performing system in the shared task has reported a 71.4% F1 score.


Introduction
Besides understanding the language humans communicate in, AI systems that naturally interact with humans should also understand implicit emotions in language. To be consistent and meaningful, an AI system conversing with humans should reply while taking into account the emotion of the utterance spoken by the human. If the user appears to be unhappy, a subsequent joyful response from the system would likely detract from the engagement of the user in the conversation. In recent years, several researchers have attempted to address this problem by developing automated emotion prediction for text (Medhat et al., 2014).
Predicting emotions implicit in natural language is not trivial. A naïve attempt to classify text based on emotion keywords may not always work due to the presence of various linguistic phenomena (e.g., negation, ambiguities, etc.) in the text. Moreover, emotion may be triggered by a sequence of words and not just a single keyword, requiring an * indicates equal contribution. automated system to understand the underlying semantics of the text. In the WASSA shared task, keywords describing the emotion label have been removed, making the emotion implicit in the text. This makes the task more challenging. Typically, a system developed for implicit emotion prediction must understand the meaning of the entire text and not just predict using a few keywords. We propose a model which uses a CNN based architecture (Gehring et al., 2017) for emotion prediction. The model stacks CNN blocks on ELMo (Embeddings from Language Model), as introduced by Peters et al. (2018). Additionally, we include word level Valence, Arousal, and Dominance (VAD) features for guiding our model towards prediction. We describe our model in detail in Section 4. As described in Section 6, our model achieves 66% accuracy on the WASSA task. We further investigate the generalizability of our model by experimenting on the Cornell movie dataset as shown in Section 7.

Related Work
Emotion prediction is related to the task of sentiment analysis. The best performance in sentiment analysis has been attained using supervised techniques as outlined in a survey by Medhat et al. (2014). Recent breakthroughs in deep learning have shown strong results in sentence classification (Joulin et al., 2016), language modeling (Dauphin et al., 2016) and sentence embedding (Peters et al., 2018). Our emotion prediction model is also based on deep learning techniques. Recently, fastText (Joulin et al., 2016) has been proposed for generating word representations which have shown state-of-the-art performance on a number of text related tasks. Our model makes use of a fastText model for emotion classification. Chen et al. (2018) introduce an emotion corpus based on conversations taken from Friends TV scripts and propose a similar emotion classification model using a CNN-BiLSTM. Our model is similar to the model proposed by (Chen et al., 2018), but we use a pre-trained ELMo instead of a BiLSTM.
Mohammad (2018) have proposed a VAD lexicon for emotion detection systems. We use VAD features together with ELMo (Peters et al., 2018). Recently, the ELMo model has been shown to boost performance on a number of Natural Language Processing (NLP) tasks. To the best of our knowledge, we are the first to make use of VAD features in a deep learning setting for emotion prediction.

Task Description
The WASSA 2018 shared task * (Klinger et al., 2018) is about predicting implicit emotion in a given tweet. The task is challenging because the keyword indicative of the emotion has been removed from the tweet. The participating system is required to predict the implicit emotion based on the remaining context using world knowledge or statistical techniques.

Emotion Corpus
The corpus provided for the competition has around 188,000 tweets (∼150,000 for training, ∼9,000 for validation, ∼28,000 for testing) annotated with 6 emotion labels (anger, surprise, joy, sad, fear, disgust). The dataset has a balanced distribution of examples for the six label classes (see Table 1).

Emotion Prediction Model
Our model has two sets of classifiers at its disposal: an ensemble of CNN-based classifiers and a fast-Text classifier (Joulin et al., 2016). A CNN-based * http://implicitemotions.wassa2018. com classifier requires a fixed length input. Since tweets have a variable number of words, padding is typically added to the shorter word sequences in order to have equal lengths across the mini-batch. In practice, having long sequences may not work well due to noise introduced by padding. Based on tweet length distribution (see Figure 1) and our experiments, we set the maximum length of a tweet to 40 words. These tweets were classified using CNN based models. For longer tweets (> 40), we used a fastText classifier. FastText works by averaging word representations into a text representation using bag of words (BoW) and bag of n-grams. The text representation is then fed into a linear classifier with a hierarchical softmax on top. FastText was chosen based on its simplicity and efficiency.

Deep CNN Classifier
We use an ensemble of CNN-based classifiers for shorter (< 41 words) tweets. Each of the CNNbased classifiers in the ensemble has a network architecture as shown in Figure 2. The CNN classifier has two sub-modules: • Text sub-module: At the lowest level, this module captures the dependencies between the words of the tweet using a Bi-Directional LSTM model with sub-word information (extracted via character-level CNN) as introduced in ELMo by Peters et al. (2018). The weights of this recurrent network were initialized with values provided by the authors (pre-trained on a 1 billion word benchmark), and updated during training. The next layers of the classifier are CNN blocks (see §4.2).  • Emotion sub-module: In this sub-module, the model uses VAD emotion values (see §4. 3), this is followed by a CNN block layer.
Outputs from both networks are mapped to constant size layers, concatenated, mapped to the output (classification) layer of size 6 and normalized using a softmax function. An overview of the system is presented in Figure 2.

Convolutional Block Structure
We base our network on Convolutional Blocks introduced by Gehring et al. (2017). We make use of a CNN encoder, which consists of several convolutional layers (blocks), followed by Gated Linear Units (GLU) layers introduced by Dauphin et al. (2016), and residual connections. The architecture of the CNN block is presented in Figure 2 (right).
Inputs to the first convolutional block are ELMo representations w = (w 1 , ..., w m ), mapped to size d, for a given input sequence of length m. Each convolution kernel takes as input X ∈ R m×d and outputs a single element Y ∈ R 2d . This output is then mapped to Y ∈ R d using GLU. We use m different kernels, which are concatenated to form a final output matrix Z ∈ R m×d that serves later as an input to the next block. The output of the last block is mapped to a one dimensional vector using a linear layer.

VAD Lexicon
To model the emotions carried by each tweet at a word level we use VAD features extracted from an external lexicon introduced by Mohammad (2018). Each of the 14,000 words in the lexicon is represented by a vector in the VAD space (v ∈ [0, 1] 3 ) and each sentence is associated with a matrix resulting from concatenation of VAD vectors for words in the sentence (V ∈ R m×3 ). To label out-ofvocabulary (OOV) words, the closest word in the dictionary is found using a difflib library † in python (algorithm based on the Levenshtein distance). If no word with more than 90% of similarity is found, a default VAD value (v = [0.5, 0.5, 0.5]) is assigned. At the end of the process, around 50% of words of the training set are labeled with VAD values.

Classifier Ensemble
The model ensemble consists of a 6-emotion (general) CNN classifier and six binary CNN classifiers (e.g., "happy" vs all other emotions). The final prediction is made by looking for an agreement between binary classifiers -5 classifiers predict the "negative" class and the other one predicts the "positive" class with a confidence score for the "positive" class that is over a certain threshold T . If the conditions are not met, the tweet is classified using the 6-emotion classifier. The threshold T is tuned based on validation accuracy.

Experiments
In this section, we describe the procedure for training classifiers as part of the Ensemble Classifier. The parameters were tuned based on both validation loss and accuracy.

Preprocessing
Each tweet in the dataset is first tokenized using the Spacy tokenizer ‡ . Then, each of the 6 most common emojis is mapped into a sequence of ASCII characters (e.g., is mapped to ":d"). As the last step, the start and end of sentence tokens (<SOS>, <EOS>) are added, together with pad tokens (<PAD>) to match the maximum sequence length.

Training Procedure
Our Deep emotion classifier is composed of 2 CNN blocks (N = 2) stacked on top of ELMo and 1 CNN block stacked on top of VAD features. We set the window size of the Convolutional Block to 5, ELMo size to 1024 (mapped to d = 256), initial learning rate for ADAM optimizer (Kingma and Ba, 2014) to 0.001, dropout rate to 0.5, batch size to 128, and the threshold T to 0.86.
Each batch of samples used for training the binary classifiers is balanced by randomly sampling half of the batch from positive labels and half of the batch from negative labels (the number of negative labels is 5 times larger). Sampling using this process makes the training more robust to overfitting. Additionally, noise is added to the training samples; a small amount of negative labels are sampled and presented as positive labels to the classifier (Section 6.1).

Implicit Emotion Prediction Results
In this section we present the results on the Implicit Emotion Prediction task. The six binary classifiers and the 6-emotion classifier used in the Ensemble Classifier were chosen based on validation accuracy presented in Table 2. Our system achieved a macro F1 Score of 66.2%, whereas the top 3 participating systems have reported a score of 71.4%, 71% and 70.3%, respectively. Table 2 shows that some emotions (e.g., joy, fear) are easier to predict. In some cases we see improve- ‡ https://spacy.io   ment in validation accuracy after adding noise. Surprisingly, this does not lead to statistically significant improvement (Figure 3). Results for the 6-emotion classifier and the Ensemble Classifier are presented in Table 3. Some emotions are easier to predict than others, this is corroborated by the confusion matrix in Figure 4. Joy is easier to predict, whereas predicting anger remains a difficult task (also shown in Table 2). Some emotions are harder to distinguish (surprise with fear and disgust), whereas some emotions are very unlikely to be confused with each other (e.g., joy with disgust). Our model probably commits errors because firstly, emotions are not disjointa sentence can express more than one emotion at the same time (i.e., a sentence can be classified as either "disgust" or "fear"), and secondly, several emotion labels could be assigned to the same sentence by changing only the trigger words (e.g., the sentence "I am #TRIGGERWORD to see you here." can be classified both by anger and surprise,  depending on whether the trigger word was "happy" or "surprised").

Model Generalization
In order to have a better understanding of the performance of our system for real world applications, we tested our system on an explicit emotion prediction task.

Dataset and Task
For our experiments, we used the Cornell Movie Corpus built by Danescu-Niculescu-Mizil and Lee (2011), which is composed of around 300,000 utterances extracted from 600 movies. A group of internal annotators manually annotated a subset of 58,000 lines, with at most 2 of 7 emotion labels (fear, surprise, anger, disgust, joy, sad, neutral). We use this data for two experiments. In the first experiment, we measure how well the classifier predictions correlate with human annotation for the 6 emotions. For this experiment we create the dataset D 1 by randomly sampling 4800 lines consisting of 800 samples for each emotion class (except for the neutral class). In the second experiment, we measure how well the classifier is able to predict a neutral emotion. We create the dataset D 2 by extracting a subset of 45,000 neutral lines.

Prediction on 6 emotions
In the first experiment, we take the top 2 emotions predicted by the final system on D 1 and check if at least one of the predicted labels matches one of the golden labels. F1 Scores are presented in Table 4.

Prediction on neutral emotion
In the second experiment, we determine that the classifier predicts a neutral emotion if each emotion is predicted with low confidence (confidence lower than 0.5). We evaluate our system on D 2 . The final system predicts a neutral emotion for 85% of sentences, whereas fastText only reaches 4% of accuracy. FastText misclassifies those neutral lines as joy with high confidence (> 80%).
In conclusion, the results show that our model generalizes well on the Cornell Movie corpus when compared to a fastText classifier, pre-trained similarly on the task dataset. While we do not expect to reproduce precisely the same performance on the Cornell Movie Corpus, since the word distribution and writing style are very different, the system generalizes reasonably well.