EmotionX-JTML: Detecting emotions with Attention

This paper addresses the problem of automatic recognition of emotions in conversational text datasets for the EmotionX challenge. Emotion is a human characteristic expressed through several modalities (e.g., auditory, visual, tactile). Trying to detect emotions only from the text becomes a difficult task even for humans. This paper evaluates several neural architectures based on Attention Models, which allow extracting relevant parts of the context within a conversation to identify the emotion associated with each utterance. Empirical results in the validation datasets demonstrate the effectiveness of the approach compared to the reference models for some instances, and other cases show better results with simpler models.


Introduction
With technology increasingly present in people's lives, human-machine interaction needs to be as natural as possible, including the recognition of emotions. Emotions are an intrinsic characteristic of humans, often associated with mood, temperament, personality, disposition or motivation (Averill, 1980). Moreover, emotions are inherently multimodal, as such, we perceived them in great detail through vision or speech (Jain and Li, 2011).
Detecting emotions from text poses particular difficulties. For instance, an issue that arises from working with conversational text data is that the same utterance (message) can express different emotions depending on its context. The table 1 illustrate the issue with some utterances expressing different emotions with the same word from the challenge datasets (Chen et al., 2018). Table 1: Two dialogs from Friends TV scripts. The word "Okay!" denote different emotions depending of the context.
Despite improvements with neural architectures, given an utterance in a conversation without any previous context, it is not always obvious even for human beings to identify the emotion associated. In many cases, the classification of utterances that are too short is hard. For instance, the utterance 'Okay' can be either an Agreement or indicative of Anger, for such cases the context plays an essential role at disambiguation. Therefore, using context information from the previous utterances in a conversation flow is a crucial step for improving DA classification.
In this paper, we explore the use of AMs to learn the context representation, as a manner to differentiate the current utterance from its context as well as a mechanism to highlight the most relevant information while ignoring unnecessary parts for emotion classification. We propose and compare different neural-based methods for context representation learning by leveraging a recurrent neu-ral network architecture with LSTM (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRUs) (Chung et al., 2014) in combination with AMs.

Related Work
The identification of emotions is an essential task for understanding natural language and building conversational systems. Previous works on recognizing emotion in text documents consider three categories: keyword-based, learning-based, and hybrid recommendation approaches (Kao et al., 2009).
In recent years, learning methods based on neural architectures have achieved great success. Emotion recognition can be framed as a sentences classification task and has been addressed using various traditional statistical methods, such as Markov Models (HMM) (Stolcke et al., 2000), conditional random fields (CRF) (Zimmermann, 2009) and support vector machines (SVM) (Henderson et al., 2012). Recent work has shown advances in text classification using deep learning techniques, such as convolutional neural networks (CNN) (Kalchbrenner and Blunsom, 2013;Lee and Dernoncourt, 2016), recurrent neural networks (RNNs) (Lee and Dernoncourt, 2016;Ji et al., 2016) and short-term long memory models (LSTM) (Shen and . Recent previous works have suggested utilizing context as possible prior knowledge for utterance classification (Lee and Dernoncourt, 2016;Shen and Lee, 2016). Contextual information from preceding utterances has been found to improve the classification performance, but it depends on the specific aspect of the dataset Ortega and Vu (2017). These works highlight that such information should be differentiable from the current utterance information; otherwise, the contextual information could have a negative impact.
Attention mechanisms (AMs) introduced by Bahdanau et al. (2014) have contributed to significant improvements in many natural language processing tasks, for instance machine translation (Bahdanau et al., 2014), sentence classification (Shen and Lee, 2016) and summarization (Rush et al., 2015), uncertainty detection (Adel and Schütze, 2016), speech recognition (Chorowski et al., 2015), sentence pair modeling (Yin et al., 2015), question-answering (Golub and He, 2016), document classification (Yang  (Rocktäschel et al., 2015) . AMs let the model decide what parts of the input to pay attention to according to the relevance of the task.

Data
Conversational datasets with utterance information are accessible such as movies, television scripts or chat records. Although, despite the importance of emotion detection in conversational systems, most datasets do not have emotion tags, so it is not possible to use such data directly to train models to identify emotions. The EmotionX challenge provides two annotated datasets with emotions tags. The first, denoted Friends, contains the scripts of seasons 1 to 9 of Friends TV shows 1 . The second, denoted EmotionPush, consist of private conversations between friends on Facebook Messenger collected by the appEmotionPush (2016).
Each utterance in the datasets has the same format: the user, the message, and the emotion label. The labels are one of six primary emotions anger, disgust, fear, happiness, sadness, surprise, and neutral defined in (1987). EmotionPush dataset has more skewed label distribution than Friends dataset as shown in Fig.1.
Both Friends and EmotionPush datasets contain 1,000 dialogues. The length distribution of utterances in EmotionPush dataset is much shorter than the length of those of TV show scripts (10.67 vs. 6.84). The EmotionPush dataset is anonymized to hide users' details such as names of real people, locations, organizations, and email addresses. Ad- convolution layer pooling layer word embeddings...

Context Attention Method
Context representation Softmax Figure 2: An overview of the architecture of the model based on Attention for classifying emotions in the conversation context. ditional steps were applied to ensure the privacy of users as described in the dataset paper (Chen et al., 2018).

Model
The architecture of the model considers two main parts: the CNN-based utterance representation and the attention mechanism for context representation learning. The Figure 2 shows an overview of the model. The model feeds the context representation into a softmax layer which outputs the posterior of each context utterances given the current utterance.

Utterance Representation
The proposed architecture uses CNNs for the representation of each utterance. For the emotion classification task, the input matrix represents an utterance and its context (i.e.,n previous utterances). Each column of the matrix stores the embeddings of the corresponding word, resulting in d dimensional input matrix M ∈ R M ×d . The weights of the word embeddings use the 300dimensional GloVe Embeddings pre-trained on Common Crawl data (Pennington et al., 2014). The model performs a discrete 1D convolution on an input matrix with a set of different filters of width |f | across all embedding dimensions d, as described by the following equation: After the convolution, the model applies a max pooling operation that stores only the highest activation of each filter. Additionally, the model applies filters with different window sizes 3-5 (multiwindows), which span a different number of input words. Then, the model concatenates all feature maps to one vector which represents the current utterance and its context.

Attention Layer
The model applies an attention layer to different sequences of input vectors, e.g., representations of consecutive utterances in a conversation. For each of the input vectors u(t − i) at time step t − i in a conversation, the model computes the attention weights for the current time step t as follows: where f is the scoring function. In the model, f is the linear function of the input u(t − i) where W is a trainable parameter. The output attentive u after the attention layer is the weighted sum of the input sequence.

Context Modeling
This paper evaluates different methods to learn the context representation using AMs.
Max This method applies max-pooling on top of the utterance representations which spans all the contexts and the embedding dimension.
Input This method applies the attention mechanism directly on the utterance representations. The weighted sum of all the utterances represents the context information.

GRU-Attention
This method uses a sequential model with GRU cells on top of the utterance representations to learn the relationship between the context and the current utterance over time. The output of the hidden layer of the last state is the context representation.

Experiments
For the experiments, neural architectures apply an end-to-end learning approach, i.e., with minimum text preprocessing. For cross-validation, the splitting strategy divides them by the dialogues, similar to (Chen et al., 2018). The challenge evaluates the performance using the metrics weighted accuracy (WA) and unweighted accuracy (UWA), as defined in equations 5 and 6. WA = l∈C s l a l (5) where a l denotes the accuracy of emotion class l and s l denotes the percentage of utterances in emotion class l.
The Table 2 shows the experimental results including baselines for the emotion detection task. This paper evaluated a Multinomial Naive Bayes (NB) model and the proposed Attention Model (AM). Surprisingly, NB model outperforms neural models for UWA metric in both datasets with 57.4% and 57.3%. This result could be related to the size of the dataset since neural architectures take advantage of learning on large-scale datasets.
The attention model performs well on the Emo-tionPush dataset but fails to improve on the Friends datasets for WA metric. Further evaluation of the results as depicted in the Fig. 3, show that the label imbalance for neutral emotion affects the predictions of other labels.

Conclusions and Future Work
This paper presents a neural attention model for the EmotionX challenge. Attention models take advantage of the context information in conversational datasets for recognizing emotions. The results obtained through several experiments outperformed the baseline methods in some metrics in the emotionPush dataset and was less effective on the Friends dataset.
Despite the promising results with Attention Models, the model struggles to accurately detect ambiguous utterances in the Friend dataset due to the label imbalance and the small scale of it. As such, large-scale conversational corpus with annotated data becomes crucial for pushing the frontiers in emotion recognition.
Attention methods have the potential to provide improved accuracy in detecting emotions in conversational datasets, and future work can explore additional strategies for Attention Models.