ANA at SemEval-2019 Task 3: Contextual Emotion detection in Conversations through hierarchical LSTMs and BERT

This paper describes the system submitted by ANA Team for the SemEval-2019 Task 3: EmoContext. We propose a novel Hierarchi- cal LSTMs for Contextual Emotion Detection (HRLCE) model. It classifies the emotion of an utterance given its conversational con- text. The results show that, in this task, our HRCLE outperforms the most recent state-of- the-art text classification framework: BERT. We combine the results generated by BERT and HRCLE to achieve an overall score of 0.7709 which ranked 5th on the final leader board of the competition among 165 Teams.


Introduction
Social media has been a fertile environment for the expression of opinion and emotions via text. The manifestation of this expression differs from traditional or conventional opinion communication in text (e.g., essays). It is usually short (e.g. Twitter), containing new forms of constructs, including emojis, hashtags or slang words, etc. This constitutes a new challenge for the NLP community. Most of the studies in the literature focused on the detection of sentiments (i.e. positive, negative or neutral) (Mohammad and Turney, 2013;Kiritchenko et al., 2014).
Recently, emotion classification from social media text started receiving more attention (Mohammad et al., 2018;Yaddolahi et al., 2017). Emotions have been extensively studied in psychology (Ekman, 1992;Plutchik, 2001). Their automatic detection may reveal important information in social online environments, like online customer service. In such cases, a user is conversing with an automatic chatbot. Empowering the chatbot with the ability to detect the user's emotion is a step forward towards the construction of an emotionally intelligence agent. Giving the detected emotion, an emotionally intelligent agent would generate an empathetic response. Although its potential convenience, detecting emotion in textual conversation has seen limited attention so far. One of the main challenges is that one users utterance may be insufficient to recognize the emotion (Huang et al., 2018). The need to consider the context of the conversion is essential in this case, even for human, specifically given the lack of voice modulation and facial expressions. The usage of figurative language, like sarcasm, and the class size's imbalance adds up to this problematic (Chatterjee et al., 2019a).  In this paper, we describe our model, which was proposed for the SemEval 2019-Task 3 competition: Contextual Emotion Detection in Text (Emo-Context). The competition consists in classifying the emotion of an utterance given its conversational context. More formally, given a textual user utterance along with 2 turns of context in a conversation, the task is to classify the emotion of user utterance as Happy, Sad, Angry or Others (Chatterjee et al., 2019b). The conversations are extracted from Twitter.
We propose an ensemble approach composed of two deep learning models, the Hierarchical LSTMs for Contextual Emotion Detection (HRLCE) model and the BERT model (Devlin et al., 2018). The BERT is a pre-trained language model that has shown great success in many NLP classification tasks. Our main contribution consists in devising the HRLCE model. Figure 1 illustrates the main components of the HRLCE model. We examine a transfer learning approach with several pre-trained models in order to encode each user utterance semantically and emotionally at the word-level. The proposed model uses Hierarchical LSTMs (Sordoni et al., 2015) followed by a multi-head self attention mechanism (Vaswani et al., 2017) for a contextual encoding at the utterances level.
The model evaluation on the competition's test set resulted in a 0.7709 harmonic mean of the macro-F1 scores across the categories Happy, Angry, and Sad. This result ranked 5th in the final leader board of the competition among 142 teams with a score above the organizers' baseline.

Embeddings for semantics and emotion
We use different kinds of embeddings that have been deemed effective in the literature in capturing not only the syntactic or semantic information of the words, but also their emotional content. We breifly describe them in this section.
GloVe, (Pennington et al., 2014) is a widely used pre-trained vector representation that captures fine-grained syntactic and semantic regularities. It has shown great success in word similarity tasks and Named Entity Recognition benchmarks.
ELMo, or Embeddings from Language Models, (Peters et al., 2018) are deep contextualized word representations. These representations enclose a polysemy encoding, i.e., they capture the variation in the meaning of a word depending on its context. The representations are learned functions of the input, pre-trained with deep bi-directional LSTM model. It has been shown to work well in practice on multiple language understanding tasks like question answering, entailment and sentiment analysis. In this work, our objective is to detect emotion accurately giving the context. Hence, employing such contextual embedding can be crucial.
DeepMoji (Felbo et al., 2017) is a pre-trained model containing rich representations of emotional content. It has been pre-trained on the task of predicting the emoji contained in the text using Bi-directional LSTM layers combined with an attention layer. A distant supervision approach was deployed to collect a massive (1.2 billion Tweets) dataset with diverse set of noisy emoji labels on which DeepMoji is pre-trained. This led to stateof-the art performance when fine-tuning Deep-Moji on a range of target tasks related to sentiment, emotion and sarcasm.

Hierarchical RNN for context
One of the building component of our proposed model (see Figure 1) is the Hierarchical or Context recurrent encoder-decoder (HRED) (Sordoni et al., 2015). HRED architecture is used for encoding dialogue context in the task of multi-turn dialogue generation task (Serban et al., 2016). It has been proven to be effective in capturing the context information of dialogue exchanges. It contains two types of recurrent neural net (RNN) units: encoder RNN which maps each utterance to an utterance vector; context RNN which further processes the utterance vectors. HRED is expected to produce a better representation of the context in dialogues because the context RNN allows the model to represent the information exchanges between the two speakers.

BERT
BERT, the Bidirectional Encoder Representations for Transformers, (Devlin et al., 2018) is a pretrained model producing context representations that can be very convenient and effective. BERT representations can be fine-tuned to many downstream NLP tasks by adding just one additional output layer for the target task, eliminating the need for engineering a specific architecture for a task. Using this setting, it has advanced the stateof-the-art performances in 11 NLP tasks. Using BERT in this work has slightly improved the final result, when we combine it with our HRLCE in an ensemble setting.

Importance Weighting
Importance Weighting (Sugiyama and Kawanabe, 2012) is used when label distributions between the training and test sets are generally different, which is the case of the competition datasets (Table 2). It corresponds to weighting the samples according to their importance when calculating the loss.
A supervised deep learning model can be regarded as a parameterized function f (x; θ). The backpropagation learning algorithm through a differentiable loss is a method of empirical risk minimization (ERM). Denote (x tr i , y tr i ), i ∈ [1 . . . n tr ] are pairs of training samples, testing samples are (x te , y te ), i ∈ [1 . . . n te ]. The ratio P (x) te /P (x) tr is referred as the importance of a sample x. When the label distribution of training data and testing data are different: P (x te ) = P (x tr ), the training of the model f θ is then called under covariate shift. In such situation, the parameterθ should be estimated through importance-weighted ERM: (1)

Models
Denote the input x = [u 1 , u 2 , u 3 ], where u i is the ith penultimate utterance in the dialogue. y is the emotion expressed in u 3 while giving u 1 and u 2 as context.
To justify the effectiveness of the modules in HRLCE, we propose two baseline models: SA-LSTM (SL) and SA-LSTM-DeepMoji (SLD). The SL model is part of the SLD model, while the later one composes the utterance encoder of our HRLCE. Therefore, we illustrate the models consecutively in Sections 3.1, 3.2, and 3.3.

SA-LSTM (SL)
Let x be the concatenation of u 1 ,u 2 , and u 3 . Hereby, x = [x 1 , x 2 , · · · , x n ], where x i is the ith word in the combined sequence. Denote the pre-trained GloVe model as G. As GloVe model can be directly used by looking up the word x i , we can use G(x i ) to represent its output. On the contrary, ELMo embedding is not just dependent on the word x i , but on all the words of the input sequence. When taking as input the entire sequence x, n vectors can be extracted from the pre-trained ElMo model. Denote the vectors as E = [E 1 , E 2 , · · · , E n ]. E i contains both contextual and semantic information of word x i . We use a two-layer bidirectional LSTM as the encoder of the sequence x. For simplicity, we denote it as LST M e . In order to better represent the information of x i , we use the concatenation of G(x i ) and E i as the feature embedding of x i . Therefore, we have the following recurrent progress:  Table 1: Macro-F1 scores and its harmonic means of the four models the n hidden states of encoder given the input x. Self-attention mechanism has been proven to be effective in helping RNN dealing with dependency problems (Lin et al., 2017). We use the multi-head version of the self-attention (Vaswani et al., 2017) and set the number of channels for each head as 1.
Denote the self-attention module as SA, it takes as input all the hidden states of the LSTM and summarizes them into a single vector. This process is represented as h sa . To predict the model, we append a fully connected (FC) layer to project h sa x on to the space of emotions. Denote the FC layer as output.

SA-LSTM-DeepMoji (SLD)
SLD is the combination of SA and DeepMoji. An SLD model without the output layer is in fact the utterance encoder of the proposed HRLCE, which is illustrated in the right side of Figure 1. Denote the DeepMoji model as D, when taking as input x, the output is represented as h d x = D(x). We concatenate h d x and h sa x as the feature representation of sequence of x. Same as SL, an FC layer is added in order to predict the label:

HRLCE
Unlike SL and SLD, the input of HRLCE is not the concatenation of u 1 , u 2 , and u 3 . Following the annotation in Section 3.1 and 3.2, an utterance u i is firstly encoded as h sa u i and h d u i . We use another two layer bidirectional LSTM as the context RNN, denoted as LST M c . Its hidden states are iterated through: where h c 0 = 0. The three hidden states h c = [h c 1 , h c 2 , h c 3 ], are fed as the input to a self-attention layer. The resulting vector SA(h c ) is also projected to the label space by an FC layer.

BERT
BERT (Section 2.3) can take as input either a single sentence or a pair of sentences. A "sentence" here corresponds to any arbitrary span of contiguous words. In this work, in order to fine-tune BERT, we concatenate utterances u 1 and u 2 to constitute the first sentence of the pair. u 3 is the second sentence of the pair. The reason behind such setting is that we assume that the target emotion y is directly related to u 3 , while u 1 and u 2 are providing additional context information. This forces the model to consider u 3 differently.

Data preprocessing
From the training data we notice that emojis are playing an important role in expressing emotions. We first use ekphrasis package (Baziotis et al., 2017) to clean up the utterances. ekphrasis corrects misspellings, handles textual emotions (e.g. ':)))'), and normalizes tokens (hashtags, numbers, user mentions etc.). In order to keep the semantic meanings of the emojis, we use the emojis package 1 to first convert them into their textual aliases and then replace the ":" and " " with spaces.

Environment and hyper-parameters
We use PyTorch 1.0 for the deep learning framework, and our code in Python 3.6 can be accessed in GitHub 2 . For fair comparisons, we use the same parameter settings for the common modules that are shared by the SL, SLD, and HRLCE. The dimension of encoder LSTM is set to 1500 per direction; the dimension of context LSTM is set to 800 per direction. We use Adam optimizer with initial learning rate as 5e-4 and a decay ratio of 0.2 after each epoch. According to the description in (CodaLab, 2019), the label distribution for dev and test sets are roughly 4% for each of the emotions. However, from the dev set (Table 2) we know that the proportions of each of the emotion categories are better described as %5 each, thereby we use %5 as the empirical estimation of distribution P (x te ). We did not use the exact proportion of dev set as the estimation to prevent the overfitting towards dev set. The sample distribution of the train set is used as P (x tr ). We use Cross Entropy loss for all the aforementioned models, and the loss of the training samples are weighted according to Eq. 1.

Results and analysis
We run 9-fold cross validation on the training set. Each iteration, 1 fold is used to prevent the models from overfitting while the remaining folds are used for training. Therefore, every model is trained 9 times to ensure stability. The inferences over dev and test sets are performed on each iteration. We use the majority voting strategy to merge the results from the 9 iterations. The results are shown in Table 1. It shows that the proposed HRLCE model performs the best. The performance of SLD and SL are very close to each other, on the dev set, SLD performs better than SL but they have almost the same overall scores on the test set. The Macro-F1 scores of each emotion category are very different from each other: the classification accuracy for emotion Sad is the highest in most of the cases, while the emotion Happy is the least accurately classified by all the models. We also noticed that the performance on the dev set is generally slightly better than that on the test set.

Conclusions
Considering the competitive results generated by BERT, we combined BERT and our proposed model in an ensemble and obtained 0.7709 on the final test leaderboard. From a confusion matrix of our final submission, we notice that there are barely miss-classifications among the three categories (Angry, Sad, and Happy). For example, the emotion Sad is rarely miss-classified as "Happy" or "Angry". Most of the errors correspond to classifying the emotional utterances in the Others category. We think, as future improvement, the models need to first focus on the binary classification "Others" versus "Not-Others", then the "Not-Others" are classified in their respective emotion.