PKUSE at SemEval-2019 Task 3: Emotion Detection with Emotion-Oriented Neural Attention Network

This paper presents the system in SemEval-2019 Task 3, “EmoContext: Contextual Emotion Detection in Text”. We propose a deep learning architecture with bidirectional LSTM networks, augmented with an emotion-oriented attention network that is capable of extracting emotion information from an utterance. Experimental results show that our model outperforms its variants and the baseline. Overall, this system has achieved 75.57% for the microaveraged F1 score.


Introduction
With the rapid development of social media platforms like Twitter, a huge number of textual dialogues has increasingly emerged. It is a challenge for chat bots to generate responses based on user emotions which can avoid inappropriate conversations. Emotion detection in text (Chatterjee et al., 2019) is a research area within Natural Language Processing which is aim to detect the emotion of user expressed in text.
Many techniques have been proposed, Wang et al., Hasan et al.,Liew and Turtle used feature engineering to extract features manually. In this area, deep learning-based approches have performed well in recent years. Some methods (Wöllmer et al., 2010;Metallinou et al., 2012;Poria et al., 2017;Chernykh et al., 2017) used recurrent neural network to model the sequence of utterances for emotion detection. However, those models did not highlight the emotion-related parts. We use attention mechanism to locate the parts expressing emotions in the utterance.
The Task3 in Semeval-2019 is to detect contextual emotions in text. For this task, we propose a deep learning approach which is a combination of * * These authors have contributed equally to this work. Long Short-Term Memory network and attention mechanism.
The rest of the paper is organized as follows: Section 2 provides system overview. Section 3 describes our approach in detail. Our experiment is discussed in Section 4. We conclude our work in Section 5.

Text Preprocessing and Word Embedding
We use word embeddings as input to the model. Word embeddings are distributed vector presentations of words (Mikolov et al., 2013), capturing their syntactic and semantic information. A good word embedding can get a better classification performance. After comparison, we find that the effect of the GloVe (Pennington et al., 2014) is the best, but when we turn the word into a word vector, we find a lot of cases that are out of vocabulary(oov). In view of that, we preprocess the data as follows: • The emoji used in the chat can better express human emotions, so we turn them into corresponding emotion words and add them to the sentence, which not only solves the oov, but also increases the emotion information in the sentence • Several emoticons are replaced by the tokens "happy", "sad", "angry" • All words are lowercased

Long Short-Term Memory
LSTM is a special form of threshold RNN (Hochreiter and Schmidhuber, 1997), which is designed to deal with sequential data by sharing its internal weights across the sequence. Different from the structure of RNN, LSTM has three gates: an input gate i t , a forget gate f t , an output gate o t and a memory cell c t . Their effect is to allow the network to store and retrieve information over long periods of time.
In our approach, we use the bidirectional LSTM model to better capture the contextual information in sentences. Schuster and Paliwal shows that the bidirectional structure has better performance in classification experiments. In order to better handle the relations among the utterances of a dialogue, we use the bc-LSTM architecture (Poria et al., 2017) to process the dialogue-level classification. The architecture preserves the sequential order of utterances when constructing the dialogue representation.

Attention Mechanism
The attention mechanism was originally applied to image recognition (Itti and Koch, 2001;Mnih et al., 2014), mimicking the focus of the eye moving on different objects when the person viewed the image. Similarly, when people read an article, their attention to each part of the text is different. The attention mechanism imitates human behavior, giving each feature different weights. With the weight of a feature being greater, the contribution of this to current recognition becomes greater. Neural networks with attention mechanism have been applied in many tasks of NLP, including machine translation (Bahdanau et al., 2014;Luong et al., 2015) text summarization (Rush et al., 2015) text classification (Yang et al., 2016) sentiment classification (Chen et al., 2016) and stance classification (Du et al., 2017).
When learning the representations of text sequences, word embeddings are the most effective intermediate representations for capturing semantic information. We embed the classification label and word into the same semantic space, and then construct the semantic relatedness of them according to the similarity of word embeddings. Our model obtains the attention weights of the words through the emotion-oriented attention network, which highlights the emotion words, thus improving the performance of the emotion classification.

Model Description
Our model has two steps as follows: 1. Extract the features of each utterance in the dialogue 2. Construct the representation of the dialogue by the features of three utterances for emotion classification. In the feature extraction step: the embedding of each utterance is fed into the BiLSTM layer to construct the word representation of each word; meanwhile we obtain the attention weight of the corresponding word by the emotion-oriented attention network. We use the inner product of them to represent the word, and then feed it into the BiL-STM layer. Finally, we get the representation of each utterance after the pooling operation (Fig. 2).
In the classification step: the features of the three utterances obtained from the previous step are fed into the LSTM layer as timing information for emotion classification (Fig. 1).

Embedding Layer
An input sequence X of length T is composed of word tokens: X = {x 1 , ..., x T }. Each token x t is replaced with the corresponding vocabulary index V (t). The embedding layer transforms the token into vector e t ∈ R d which is selected from the embedding matrix E according to the index, where d is the dimensionality of the embedding space.
In order to highlight the emotion words in the sequence, we append the word embedding vector of "emotion" to the embedding of each word in original text. The emotion-augmented embedding of a word t is the concatenation of the embedding vector e t and the emotion representation e z , where denotes the concatenation operation, and then the dimention of e z t is 2d.

BiLSTM Layer
The LSTM reads the sequence X only in one direction. We use a bidirectional LSTM to get annotations of words by summarizing the contextual information from both directions. A bidirectional LSTM consists of a forward LSTM − → f that reads the sentence from x 1 to x T and a backward LSTM ← − f that reads the sentence from x T to x 1 . We obtain the annotation h t for each word x t , by concatenating the forward hidden state − → h t and the backward one

Emotion-Oriented Attention Network
In the task, the emotion words in the conversation are vital for classification, which cannot be captured by the BiLSTM. In order to highlight the emotion-related words in the utterance, we design an attention mechanism which increases the weight of the important words on the basis of the BiLSTM and contributes more to the classification decision.
We apply a linear layer to convert the emotionaugmented embedding of a word e z t to a scalar value u t , and then get a normalized importance weight α t through a softmax function. This weight is producted with the word representation h t to get a weighted word representation v t for each word.

Pooling Layer
From the idea of network in network, we use global maxpooling, global averagepooling and last tensor for the matrix f output of the BiLSTM layer. Maxpooling can get the most important features of all features (Scherer et al., 2010). Averagepooling can get the most common features of all features. The last tensor output l of the matrix f can obtain the semantic information of the sentence in forward and backward through BiLSTM. The utterance representation z is obtained by the concatenation of the max vector m, the average vector a and the last vector l. z = m||a||l (6)

Emotion Classification
We use the three utterance representations obtained by feature extractor shown in Figure 2 to construct the dialogue representation. The three utterance representations [z 1 , z 2 , z 3 ] are fed into the LSTM, and the last time-step hidden state h 3 of the LSTM is regarded as the dialogue representation r. We pass it to a fully-connected network with a softmax activation function. This layer obtains a normalized four-dimensional vector through the nonlinear transformation function of the input vector.
where W f and b f are the weights and bias terms of the fullly-connected layer.

Data
The datasets are provided by Semeval-2019 Task 3. Table 1 gives an overview of the datasets. All the conversations are collected from twitter. The conversations consist of user 1's tweet, user 2's response to the tweet and user 1's response to user2. The label is the emotion of the third turn that human judges mark after considering the context of three rounds of dialogue.

Experiments
The model is implemented using Keras 2.0 (Chollet et al., 2017). We experiment with Stanford's GloVe 300 dimensional word embeddings trained on 840 billion words from Common Crawl. Our model is trained with Adam Optimizer (Kingma and Ba, 2014) with initial learning rate of 0.001 and batch size of 64. We use BiLSTMs with hidden state size 256, with dropout rate 0.5 on the first BiLSTM layer and dropout rate 0.3 on the second one to prevent our neural network from overfitting (Srivastava et al., 2014). In our task, the size of samples for each class is not balanced, which will result in the model tending to be biased toward the majority class with poor accuracy for the minority class. For this, we adjust the parameter 'class weight' to weight the loss function of each class during training. This can be useful to tell the model to "pay more attention" to samples from an under-represented class. In this case, we set the parameter 'class weight' (Happy : 2, Sad : 1, Angry : 1, Others : 4)

Result and Analysis
In order to evaluate the effect of the emotionoriented attention network and the balanced class weights, we compare our approach with its variants and the baseline. Variant1: The variant does not adjust the parameter 'class weight'. Variant2: The variant changes the emotionoriented attention network with the attention machanism used in (Yang et al., 2016) Variant3: The variant removes the emotionoriented attention network from the model.   Table 2 shows that our model outperforms the other variants which are all above the baseline 0.5861 for the micro-averaged F1 score. Variant3 has the best performance on 'Sad' class and our model has the best performance on two classes and micro-averaged F1 score.
To validate that our model has the ability to capture the emotion-related parts of an utterance, we visualize the weights of attention for the following three dialogues. Figure 3 shows that the emotion words are highlighted in the dialogues, such as 'Haha', 'funny', 'cool', 'like', 'hate', 'felt', 'bad', 'SORRY', but the model also highlights some trivial words, such as 'Give'.

Conclusion
In this paper, we proposed an emotion-oriented neural attention network for Semeval-2019 Task 3. The network use the attention mechanism to select the emotion-related parts in the utterances. The classification performance of our model is better than its variants and the baseline. Meanwhile, the visualization shows that the model has captured more decision-making information in the dialogue.