EmotionX-DLC: Self-Attentive BiLSTM for Detecting Sequential Emotions in Dialogues

In this paper, we propose a self-attentive bidirectional long short-term memory (SA-BiLSTM) network to predict multiple emotions for the EmotionX challenge. The BiLSTM exhibits the power of modeling the word dependencies, and extracting the most relevant features for emotion classification. Building on top of BiLSTM, the self-attentive network can model the contextual dependencies between utterances which are helpful for classifying the ambiguous emotions. We achieve 59.6 and 55.0 unweighted accuracy scores in the Friends and the EmotionPush test sets, respectively.


Introduction
Emotion detection plays a crucial role in developing a smart dialogue system such as a chit-chat conversational bot [3]. As a typical sub-problem of sentence classification, emotion classification requires not only to understand sentence of a single utterance, but also capture the contextual information from the whole conversations.
The problems of sentence-level classification have been investigated heavily by means of deep neural networks, such as convolutional neural networks (CNN) [8], long short-term memory (LSTM) [12], and attention-based CNN [9]. Additional soft attention layers [1] are usually built on top of those networks, such that more attention will be paid to the most relevant words that lead to a better understanding of the sentence. LSTMs [7] are also useful to model contextual dependencies. For example, a contextual LSTM model is proposed to select the next sentence based on the former context [6], and a bidirectional LSTM (BiLSTM) is adopted to detect multiple emotions [3].
In this work, we utilize the self-attentive BiLSTM (SA-BiLSTM) model to predict multiple types of emotions for the given utterances in the dialogues. Our model imitates human's two-step procedures for classifying an utterance within the context, i.e., sentence understanding and contextual utterances dependence extraction. More specifically, we propose the bidirectional long short-term memory (BiLSTM) with the max-pooling architecture to embed the sentence into a fixedsize vector, as the BiLSTM network is capable of modeling the word dependencies in the sentence while the max-pooling helps to reduce the model size and obtains the most related features for emotion classification. Since data in this challenge is limited and specific words play significant role to classifying the corresponding emotion, we apply the self-attention network [14] to extract the dependence of all the utterances in the dialogue. Technically, the self-attention model computes the influence of utterance pairs and outputs the sentence embedding of one utterance by a weighted sum over all the utterances in the dialogue. The fully connected layers are then applied on the output sentence embedding to classify the corresponding emotion.
Hi , Joey .     [13] are adopted to represent each word (token). A sentence (utterance) with m tokens is then represented by where w i is d-dimensional word embedding for the i-th tokens in the sentence. Suppose a dialogue consists of n sentences, the input forms an n × m × d tensor, M , see Figure 1. Via the process of sentence embedding (elaborated in Section 2.1), the tensor is converted to a n × 2l matrix U , where l is the number of the hidden units for each unidirectional LSTM. By applying the self-attentive network, we re-weight the sentence embedding matrix to U ′ with the same shape as U . Finally, fully connected layers are trained to establish the mapping between input U ′ and the output emotion labels. ...

Sentence Embedding
In this work, we adopt the BiLSTM to learn the sentence embedding because it is the most popular neural network architecture to encode sentences [5,11]. The forward LSTM and backward LSTM read the sentence S in two opposite directions (see Figure 2): The vectors − → h t and ← − h t are concatenated to a hidden state h t . Max-pooling [4] is then conducted along all the words of a sentence to output the final sentence representation, u.  Figure 3: The self-attention model, called Transformer [14], is an effective non-recurrent architecture for machine translation. We adopt it to capture the utterances dependence. Figure 3 shows the model to build the dot-product attention of utterances, where the attention matrix is calculated by: where u i is the i-th sentence embedding in the dialogue, and d k (i.e. 2l) is the model dimension. An attention mask is applied to waive the inner attention between sentence embeddings and paddings. The i-th sentence embedding is finally weighted by summing over all the sentence embeddings to enhance the effect: where F i is n-dimensional vector whose j-th element is f (u i , u j ).

Output and Loss
Finally, we apply fully connected layers to produce the corresponding emotions. In the training, the weighted cross-entropy is adopted as the loss function. Since the challenge only focuses on classifying four types of emotions, rather than all eight types, we set the weights to zero for the unconsidered emotions.

Data Preprocessing
The EmotionX dataset consists of the Friends TV scripts and the EmotionPush chat logs on Faceook Messenger in eight types, i.e., Neutral, Joy, Sadness, Anger, Fear, Surprise, Disgust and Non-neutral. For the train set, there are 720 dialogues for Friends and EmotionPush, respectively, which yields a total of 1,440 dialogues, 21,294 sentences, and 9,885 unique words. In the challenge, we test the following candidate labels, Neutral, Joy, Sadness, and Anger. We also conduct the following steps to clean the data: • Unicode symbols, except emojis (the direct expressions of human emotions), are removed. Person names, locations, numbers and websites are replaced with special tokens.
• The Emoji symbols are converted to the corresponding meanings.
• Duplicated punctuation and symbols. Tokens with duplicated punctuation or alphabets, such as "oooooh", often imply non-neural emotions. We reconstruct the tokens to be oh <duplicate> to avoid informal words. The same rule also applies to similar tokens. For example, "oh!!!!!!" is replaced by oh ! <duplicate>.
• Word tokenization. We use NLTK's TwitterTokenizer [2] to split the sentences into tokens. All tokens are set lowercase.

Experimental Setup
We conduct two experiments with different model variants: BiLSTM and SA-BiLSTM, to validate whether our proposed model can learn the contextual information. The network settings for each model are summarized as follows: • BiLSTM: BiLSTM + max-pooling + fully connected layers.
The word embedding is 300-dimensional from the the Glove. Pack padded sequence and pad packed sequence are implemented to deal with varying sequence lengths. For SA-BiLSTM, we limit the utterance number to 25 for each dialogue. Due to the limit of training data, LSTM is set to one layer with only 256 hidden units. The fully connected layers consist of two middle layers with the same size of 128. The mini-batch size for training BiLSTM is set to 16. Unlike BiLSTM, we feed one dialogue to SA-BiLSTM for every training step. Adam [10] is the adopted optimizer with initial learning rate 0.0002 and decay factor 0.99 for every epoch.
Dropout probability is set to 0.3 for BiLSTM and self-attention layers. We train BiLSTM for 10 epochs and SA-BiLSTM for 20 epochs to gain the best accuracy in the validation sets.   Table 1 reports the model performance in the validation sets, which consist of 80 dialogues for Friends and EmotionPush, respectively. We evaluate two criteria, the weighted accuracy (WA) and the unweighted accuracy (UWA) [3]. The predicted accuracy for each class is also given in the table.

BiLSTM Versus SA-BiLSTM
Interestingly, the simpler model BiLSTM achieves higher WA, with up to 0.6% and 0.5% improvement in Friends and EmotionPush, respectively. On the other hand, SA-BiLSTM overperforms BiLSTM in terms of UWA, with up to 1.4% and 1.7% improvement. Note that BiLSTM tends to predict the emotions Neutral and Joy far more accurate than the other two emotions because most utterances are labeled as these two emotions, i.e., 45.03% as neural and 11.79% as joy in Friends while 66.85% as neural and 14.25% as joy in EmotionPush. Overall, SA-BiLSTM provides a more balanced prediction for each type of emotion than BiLSTM. Especially in predicting the emotions of Sadness and Anger, SA-BiLSTM gains better predictive accuracy, up to 11.3% & 6.0% on the Sadness emotion and 5.9% & 5.3% on the Anger emotion improvements in Friends and EmotionPush, respectively.

Results of Test Set
We submit the results produced by SA-BiLSTM and obtain the evaluation scores provided by the challenge organizer. Table 2 lists the experimental results evaluated in the test set, which consists of 400 dialogues, 200 dialogues for Friends and EmotionPush, respectively.
The results indicate that our model shows a strong bias towards predicting the Neutral emotion and the Joy emotion in both datasets compared to the Sadness and the Anger emotions. Especially, our model achieves an extremely poor prediction on the Anger emotion in EmotionPush. Moreover, the UWA in EmotionPush is smaller than that in Friends, which is different from our prediction results in the validation set. We conjecture that the distribution of the validation set and the test set may be slightly different. To obtain a robust solution, we may train multiple models using different random seeds and ensemble the model averaged on the checkpoints.
We notice that the Speaker information is also important for emotion classification. Table 3 shows two consecutive utterances made by the same speaker from EmotionPush, where the first utterance seems literally less emotional than the second one. Nevertheless, the two utterances should carry the same emotion, i.e., Anger, because they are made by the same speaker consecutively. On the contrary, our model gives a false prediction (i.e., Neutral) for the second utterance because it probably treats the two utterance separately. We believe that our model shall gain some improvements by adding speaker information into it.
Speaker ID Utterance 1051336806 but /you/ bug /me/ 1051336806 and you hundred percent told peopel stfu.

Conclusion
In this work, we propose SA-BiLSTM to predict multiple emotions for given utterances in the dialogues. The proposed network is a self-attentive network built on top of BiLSTM. Our results evaluated on the validation set show that BiLSTM has better WA performance, while SA-BiLSTM is advantageous to BiLSTM in terms of UWA. According to the test results, SA-BiLSTM yields higher UWA scores for detecting the Neural emotion and the Joy emotions than the Sadness and the Anger ones. The bias may be caused by uneven training data distributions. We hope to improve our model by either incorporating more related data or retrieving more linguistic information.