CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification

Detecting emotion from dialogue is a challenge that has not yet been extensively surveyed. One could consider the emotion of each dialogue turn to be independent, but in this paper, we introduce a hierarchical approach to classify emotion, hypothesizing that the current emotional state depends on previous latent emotions. We benchmark several feature-based classifiers using pre-trained word and emotion embeddings, state-of-the-art end-to-end neural network models, and Gaussian processes for automatic hyper-parameter search. In our experiments, hierarchical architectures consistently give significant improvements, and our best model achieves a 76.77% F1-score on the test set.


Introduction
Customer service can be challenging for both the givers and receivers of services, leading to emotions on both sides. Even human service-people who are trained to deal with such situations struggle to do so, partly because of their own emotions. Neither do automated systems succeed in such scenarios. What if we could teach machines how to react under these emotionally stressful situations of dealing with angry customers?
This paper represents work on the SemEval 2019 shared task (Chatterjee et al., 2019b), which aims to bring more research on teaching machines to be empathetic, specifically by contextual emotion detection in text. Given a textual dialogue with two turns of context, the system has to classify the emotion of the next utterance into one of the following emotion classes: Happy, Sad, Angry, or Others. The training dataset contains *Equal contribution.
15K records for emotion classes, and contains 15K records not belonging to any of the aforementioned emotion classes.
The most naive first step would be to recognize emotion from a given flattened sequence, which has been researched extensively despite the very abstract nature of emotion (Socher et al., 2013;Felbo et al., 2017a;McCann et al., 2017;Chatterjee et al., 2019a). However, these flat models do not work very well on dialogue data as we have to merely concatenate the turns and flatten the hierarchical information. Not only does the sequence get too long, but the hierarchy between sentences will also be destroyed (Hsu and Ku, 2018;Kim et al., 2018). We believe that the natural flow of emotion exists in dialogue, and using such hierarchical information will allow us to predict the last utterance's emotion better.
Naturally, the next step is to be able to detect emotion with a hierarchical structure. To the best of our knowledge, this task of extracting emotional knowledge in a hierarchical setting has not yet been extensively explored in the literature. Therefore, in this paper, we investigate this problem in depth with several strong hierarchical baselines and by using a large variety of pre-trained word embeddings.

Methodology
In this task, we focus on two main approaches: 1) feature-based and 2) end-to-end. The former compares several well-known pre-trained embeddings, including GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018), and BERT (Devlin et al., 2018), as well as emotional embeddings. We combine these pre-trained features with a simple Logistic Regression (LR) and XGBoost (Chen and  We mainly compare the performances of flat models and hierarchical models, which also take into account the sequential turn information of dialogues.

Feature-based Approach
The pre-trained feature-based approach can be subdivided into two categories: 1) word embeddings pre-trained only on semantic information, and 2) emotional embeddings that augment word embeddings with emotional or emoji information. We also examine the use of both categories.
Word Embeddings These include the standard pre-trained non-contextualized GloVe (Pennington et al., 2014), the contextualized embeddings from the bidirectional long short term memory (biLSTM) language model ELMo (Peters et al., 2018), and the more recent transformer based embeddings from the bidirectional language model BERT (Devlin et al., 2018).
Emotional Embeddings These refer to two types of features equipped with emotional knowledge. The first is a word-level emotional representation called Emo2Vec . It is trained with six different emotion-related tasks and has shown extraordinary performance over 18 different datasets. The second is a sentence-level emotional representation called DeepMoji (Felbo et al., 2017b), trained with a biLSTM with an attention model to predict emojis from text on a 1,246 million tweet corpus. Finally, we use Emoji2Vec (Eisner et al., 2016) which directly maps emojis to continuous representations.
ELMo This model from Peters et al. (2018) is a deep contextualized embedding extracted from a pre-trained bidirectional language model that has shown state-of-the-art performance in several natural language processing (NLP) tasks.
BERT This is the state-of-the-art bidirectional pre-trained language model that has recently shown excellent performance in a wide range of NLP tasks. Here, we use BERT BASE 2 as our sentence encoder. However, the original model failed to capture the emoji features due to the fact that all the emoji tokens are missing in the vocab. Therefore, we concatenate each sentence representation from BERT with bag of words Emoji2Vec (Eisner et al., 2016). Then, a UTRS is used as a context encoder to encode the whole sequence.
LSTM and Universal Transformer LSTM is the widely known model used almost ubiquitously in the literature, while UTRS is a recently published recurrent extension of the multi-head self-attention based model, Transformer from (Vaswani et al., 2017). Finally, for all models, we consider a hierarchical extension which considers the turn information as well. We add another instance of the same model to also encode sentencelevel information on top of the word-level representations. We also apply word-level attention to

Evaluation
In this section, we present the evaluation metrics used in the experiment, followed by results on feature-based, end-to-end, and ensemble approaches and Gaussian process search.

Training Details
Feature-Based For the feature-based approach, we run LR and XGBoost on features using the Scikit-Learn toolkit (Pedregosa et al., 2011) without any additional tuning. ELMo For the flat model, we pre-train ELMo by only fine-tuning the scalar-mix weights, as suggested in Peters et al. (2018). We extract a 1024dimension bag-of-words representation for each turn and concatenate the three turns into a 3072dimension vector which is passed to a multilayer perceptron (MLP). For the hierarchical model, we employ two methods: 1) run an LSTM model over each turn's representation 2) pre-extract all three layer weights (LSTM and CNN) and concatenate them into a 3072-dimension vector representation for each turn, which is then passed to an LSTM model. We report the results of the latter preextracted method as it performs better.
BERT For the implementation details of BERT BASE , we refer interested readers to Devlin et al. (2018). Note that for hierarchical BERT, we use a six-layer UTRS as the context encoder. Each layer of UTRS consists of a multi-head attention block with four heads, where the dimension of each head is set to be ten, and a convolution feed forward block with 50 filters. We use modified Adam optimizer from Devlin et al. (2018) to train our model. The initial learning rate and dropout are 5e-5 and 0.3 respectively.  Figure 2 have a hidden size of 1000 and dropout of 0.5, a hidden size of 1500 and dropout of 0.2, and a hidden size of 1000 and dropout of 0.4 respectively. Then, we train the UTRS using the best hyper-parameters found by the GP. It has a hidden size of 488 with a single hop and ten attention multi-heads. Noam (Vaswani et al., 2017) is used as the learning rate decay.

LSTM and Universal Transformer
Gaussian Processes GP hyper-parameter search returns a set of hyper-parameters, both continuous and discrete, and it returns the validation set F1 score. We implement the GP model using an existing library called GPyOpt. 3 We run a GP for 100 iterations using the Expected Improvement (Jones et al., 1998) acquisition function with 0.05 jitter as a starting point. We use a hierarchical universal transformer (HUTRS) as the base model since is the model with the most hyper-parameters to tune with a single split.

Evaluation Metrics
The task is evaluated with a micro F1 score for the three emotion classes, i.e., Happy, Sad and Angry, and by taking the harmonic mean of the precision and the recall. This scoring function has been provided by the challenge organizers (Chatterjee et al., 2019b).

Voting Scheme
For each model, we randomly shuffle and split the training set ten times and we apply a voting scheme to create a more robust prediction. We use a majority vote scheme to select the most often seen predictions, and in case of ties, we give the priority to Others. This scheme is applied to all end-to-end models since it improved the validation set performance.

Ensemble Models
To further refine our predictions, we build ensembles of different models. We create five ensemble models by combining the hierarchical version of BERT, LSTM, and UTRS. Finally, we gather two lesser-performing models, a hierarchical LSTM and the best feature-based model (XGBoost with ELMo and DeepMoji features), and we combine them with five ensemble predictions using majority voting to get our final prediction. Finally, we show the Pearson correlation between models in Figure 2.

Experimental Results
From Table 1, we can see that the DeepMoji features outperforms all the other features by a large margin. Indeed, DeepMoji has been trained using a large emotion corpus, which is compatible with the current task. Emoji2Vec get a very low F1score since it includes only emojis, and indeed, by adding GLoVe, a more general embedding, we achieve better performance. For the end-toend approach, hierarchical biLSTM with GLoVe word embedding achieves the highest score with a 75.64% F1-score. Our ensemble achieves a higher score compared to individual models. The best ensemble model achieves a 76.77% F1-score. As shown in Table 3, the ensemble method is effective to maximize the performance from a bag of models.

Related work
Emotional knowledge can be represented in different ways. Word-level emotional representations, inspired from word embeddings, learn a vector for each word, and have shown effectiveness in different emotion related tasks, such as sentiment classification (Tang et al., 2016), emotion classification , and emotion intensity prediction . Sentence-level emotional representations, such as DeepMoji (Felbo et al., 2017a), train a biLSTM model to encode the whole sentence to predict the corresponding emoji of the sentence. The learned model achieves stateof-the-art results on eight datasets. Sentiment lexicons from Taboada et al. (2011) show that word lexicons annotated with sentiment/emotion labels are effective in sentiment classification. This method is further developed using both supervised and unsupervised approaches in Wang and Xia (2017). Also, other models, such as a deep averaging network (Iyyer et al., 2015), attention-based network (Winata et al., 2018), and memory network (Dou, 2017), have been investigated to improve the classification performance. Practically, the application of emotion classification has been investigated on interactive dialogue systems (Bertero et al., 2016;Winata et al., 2017;Siddique et al., 2017;.

Conclusion
In this paper, we compare different pre-trained word embedding features by using Logistic Regression and XGBoost along with flat and hierarchical architectures trained in end-to-end models. We further explore a GP for faster hyperparameter search. Our experiments show that hierarchical architectures give significant improvements and we further gain accuracy by combining the pre-trained features with end-to-end models.