EmotionX-Area66: Predicting Emotions in Dialogues using Hierarchical Attention Network with Sequence Labeling

This paper presents our system submitted to the EmotionX challenge. It is an emotion detection task on dialogues in the EmotionLines dataset. We formulate this as a hierarchical network where network learns data representation at both utterance level and dialogue level. Our model is inspired by Hierarchical Attention network (HAN) and uses pre-trained word embeddings as features. We formulate emotion detection in dialogues as a sequence labeling problem to capture the dependencies among labels. We report the performance accuracy for four emotions (anger, joy, neutral and sadness). The model achieved unweighted accuracy of 55.38% on Friends test dataset and 56.73% on EmotionPush test dataset. We report an improvement of 22.51% in Friends dataset and 36.04% in EmotionPush dataset over baseline results.


Introduction
Emotion detection and classification constitutes a significant part of research in the area of natural language processing (NLP). The research aims to detect presence of an emotion in a text snippet and correctly categorize the same. The emotions are typically classified using categories proposed by (Ekman et al., 1987), namely anger, disgust, fear, joy, sadness, surprise. Significant amount of research has been dedicated to emotion classification in variety of texts like news and news headlines (Strapparava and Mihalcea, 2008;Staiano and Guerini, 2014), blogposts (Mishne, 2005), fiction (Mohammad, 2012b).
With the advent of social media and dialogue systems like personal assistants and chatbots,  (Krcadinac et al., 2013) for instant messages, (Mohammad, 2012a) and (Wang et al., 2012) for Twitter and (Preotiuc-Pietro et al., 2016) for status updates in Facebook.
Conversational short texts consist of dialogues between two or more entities. A dialogue naturally has a hierarchical structure, with words contributing to an utterance and a set of utterances contributing to a dialogue (Kumar et al., 2017). Table 1 shows an example of a dialogue which consists of 6 utterances with corresponding speakers Figure 1: An illustration of proposed Hierarchical Attention Network and emotions. In these dialogues, context builds as the dialogue progresses. There is a dependency between consecutive utterances and hence the classification of such utterances can be treated as a sequence labeling problem. In particular, (Stolke et al., 2000;Venkataraman et al., 2003) and (Kim et al., 2010;Chen and Eugenio, 2013;Kumar et al., 2017) have captured dependencies in utterances for dialogue act classification using Hidden Markov Model (HMM) and Conditional Random Field (CRF) respectively. Also, several ways of incorporating such context information in artificial neural networks have been proposed in (Liu, 2017).
The EmotionX shared task consists of detecting emotions for each utterance from EmotionLines dataset. The dataset (Chen et al., 2018) contains dialogues collected from Friends TV show scripts and private Facebook messenger chats. Each of the utterances has been annotated for one of the eight emotions viz. six basic emotions proposed by (Ekman et al., 1987) and two other emotions viz. neutral, non-neutral. The shared task focuses on detecting only four of these eight emotions, namely joy, sadness, anger and neutral. In this paper, we present our approach to detect emotions in utterances. Inspired by (Kumar et al., 2017), we use Hierarchical Attention Network (HAN) to build context both at utterance and dialogue level. We treat emotion detection at utterance level as a sequence labeling problem and use a linear chain CRF as a classifier.

Proposed Model
The dataset for the task consists of dialogues, each dialogue (D i ) consists of sequence of utterances denoted as D i = (u 1 , u 2 , . . . u n ), where n is the number of utterances in a given dialogue. Each utterance u j is associated with a target emotion label y j ∈ Y. To build context within a dialogue, we consider a moving context window N k of length k and combine all the utterances within the window with their target labels to create multiple sets of context utterances. These sets of utterances are given as input to our model.
The model consists of HAN (Yang et al., 2016), where the first part is a word-level encoder with the attention layer, encoding each word in an utterance. The second part is an utterance-level encoder, encoding each utterance in the dialogue. The HAN is combined with a linear chain CRF classification layer for detecting emotions. The utterance level emotion detection is treated as a sequence labeling problem based on the fact that the emotion in an utterance depends on emotions of previous utterances. An illustration of complete model comprising of embedding layer, word level encoder, attention layer , utterance level encoder with final layer of CRF classification is depicted in Figure 1.

Model Desscription
Embedding Layer: A context window N k consists of k utterances each having l number of words. Each word w ij in an utterance u j , where j ∈ [1, k] , is embedded to a low-dimensional vector space R d using an embedding layer (f embed ) of size d. It projects the word into representative word vector x ij . We initialize the weights of the embedding layer with pre-trained GloVe embeddings 1 . x Word-level Encoder: We use a bidirectional Gated Recurrent Unit (GRU)  as the word-level encoder in the hierarchical network to summarize information from both directions for words. The bidirectional GRU contains the forward GRU which reads the utterance u j from w 1j to w lj and a backward GRU which reads from w lj to w 1j : The forward hidden state − → h ij and backward hidden state ← − h ij are concatenated to obtain word encoded representation h ij .
Attention Layer: The intuition for using an attention layer is that a few words in an utterance are more important in identifying an emotion. Moreover, the informativeness of words is context dependent i.e. same set of words contribute differently in different context. We augment the Word-level Encoder with a deep self-attention mechanism Baziotis et al., 2017) to obtain a more accurate estimation of the importance of each word. The attention mechanism assigns a weight α ij to each word representation. Formally: s j = α ij h ij 1 https://nlp.stanford.edu/projects/glove/ where s j is the utterance representation.
Utterance-Level Encoder: Similar to Word Level Encoder, the set of utterance representations s j is passed to a bidirectional GRU to obtain the final representation g j at utterance level. These representations are passed to CRF classification layer.
Linear Chain CRF: Bidirectional encoder captures dependencies among utterances. To model the dependency among labels, the final utterance representations are passed to the linear chain CRF classifier layer. CRFs are undirected graphical models that predict the optimal label sequence given an observed sequence. For a given context window N k , the probability of predicting sequence of emotion labels for a set of utterance representations g and corresponding emotion label set y is where w j is the set of parameters corresponding to CRF layer and F j (g, y) is the feature function (Maskey, Spring 2010).

Data Preparation
The dataset consists of two sets, viz. 1) dialogues collected from Friends TV show script and 2) Facebook messenger private chats. Both these datasets have characteristics of short texts. We describe our preprocessing strategies for these datasets below.

Pre-processing
EmotionPush: These are informal chats between two individuals. This data has typical characteristics of short texts. It contains incomplete sentences, informal language, use of emoticons, excessive use of punctuations like '?' and '!'. As a part of preprocessing, we convert all the emoticons to appropriate emotion word. We also replace all occurrences of date and time with named entities 'DATE' and 'TIME'. We convert all contracted forms like 'can't','haven't' to appropriate expanded forms like 'can not' and 'have not'. The dataset contains named entities such as 'PERSON 354', 'ORGANIZATION 78' and 'LOCATION 8'. These entities are important to build the context but they do not appear in word embeddings. We convert all these named entities to pseudo entities which are present in word embeddings but not present in the EmotionPush dataset vocabulary.    Friends -TV Show scripts: This dataset contains scene snippets having interaction between two or more speakers. Some of the utterances are incomplete and some have excessive use of punctuations. Unlike EmotionPush dataset, there are no emoticons and tagged named entities in this data. We convert the contracted forms as mentioned above and remove extra punctuations. In this dataset, speaker and words uttered by the speaker play an important role in building the context. To incorporate this, we concatenate speaker information to every utterance.

Experiments and Results
The EmotionX challenge consists of detecting emotions for each utterance from EmotionLines dataset. Each of the utterances has been annotated for one of the eight emotions, anger, sadness, joy, fear, disgust, surprise, neutral and nonneutral. Even though the shared task consists of detection of only four emotions, viz. joy, sadness, anger and neutral, we consider all emotions in our model. We train the model separately for each dataset. We use pre-trained 100-dimensional GloVe-Tweet embedding for both datasets. These embeddings are used to initialize weights of the embedding layer.
We also consider word priors as features. Word prior for a word is computed as where count(w i , c j ) is frequency of word w i in class c j and count(c j ) is total number of words in class c j . We determine word priors for every word for all 8 emotion classes and concatenate these 8 features to embedding feature vectors.
The hyper-parameters such as window length for context window, learning rate, optimizer, early stopping and dropout were tuned for performance during experimentation.
Results on both EmotionPush and Friends test sets are listed in Table 2. We also report model performance on both the development datasets in Table 3 and Table 4. The model achieved improvement of 22.51% in Friends dataset and 36.04% in EmotionPush dataset over baseline (Chen et al., 2018) results. We report overall unweighted accuracy of 56.73% on EmotionPush test dataset and accuracy of 55.38% on Friends test dataset.

Discussion
To understand how the context builds over the dialogues, we performed exploratory analysis on both the datasets. In Friends dataset, we found some anomalies which can impact the performance of our system. 1. A few dialogues consist of utterances from different scenes which breaks the continuity of the dialogue. 2. Some utterances have scene descriptions as part of the utterance. For example, in record {"speaker": "Joey", "utterance": "and Phoebe picks up a wooden baseball bat and starts to swing as Chandler and Monica enter.)", "emotion": "non-neutral"}, utterance is a scene description and not spoken by any speaker. 3. We also found few utterances having no words but only a punctuation ('.' or '!') which is attached with an emotion. For example, a) {"speaker": "Rachel", "utterance": "!", "emotion": "non-neutral"} b) {"speaker": "Phoebe", "utterance": ".", "emotion": "non-neutral"} We did not find such anomalies in EmotionPush dataset.
The word embeddings do not have explicit emotion information for words. To incorporate this, we added word priors per class to word vectors and examined their effect on the performance of our model. Word priors improve the model performance by 17% in EmotionPush dataset and 19% in Friends dataset. For example, utterances like "Lol weird" and "I also have no shoes lol" belonging to emotion class 'joy' were misclassified without using word priors as features. Similarly, utterances such as "Sorry he cannot" and "Sorry about that person 107" belonging to emotion class 'sadness' were also misclassified.

Conclusion
In this paper, we present our submission for Emo-tionX emotion detection challenge. We use Hierarchical Attention Network (HAN) model to learn data representation at both utterance level and dialogue level. Additionally, we formalize the problem as sequence labeling task and use a linear chain Conditional Random Field (CRF) as a classification layer to classify the dialogues in both Friends and EmotionPush dataset. The model achieved improvement of 22.51% in Friends dataset and 36.04% in EmotionPush dataset over baseline results. In future, we would like to explore the speaker-listener relation with emotion and lexical features to improve the performance of the system.