Knowledge Aware Emotion Recognition in Textual Conversations via Multi-Task Incremental Transformer

Emotion recognition in textual conversations (ERTC) plays an important role in a wide range of applications, such as opinion mining, recommender systems, and so on. ERTC, however, is a challenging task. For one thing, speakers often rely on the context and commonsense knowledge to express emotions; for another, most utterances contain neutral emotion in conversations, as a result, the confusion between a few non-neutral utterances and much more neutral ones restrains the emotion recognition performance. In this paper, we propose a novel Knowledge Aware Incremental Transformer with Multi-task Learning (KAITML) to address these challenges. Firstly, we devise a dual-level graph attention mechanism to leverage commonsense knowledge, which augments the semantic information of the utterance. Then we apply the Incremental Transformer to encode multi-turn contextual utterances. Moreover, we are the first to introduce multi-task learning to alleviate the aforementioned confusion and thus further improve the emotion recognition performance. Extensive experimental results show that our KAITML model outperforms the state-of-the-art models across five benchmark datasets.


Introduction
Emotion recognition in textual conversations (ERTC), which aims to identify the emotion of each utterance from the transcript of a conversation, has become a popular research topic in recent years. ERTC can be widely used in various scenarios, such as opinion mining of comments in social media (Chatterjee et al., 2019), emotion analysis of customers in artificial customer service, and others. In addition, it can also be applied to chat robots to analyze the user's emotional state in real time and generate emotion-aware responses (Poria et al., 2019b;Zhou et al., 2018a;.   (Chatterjee et al., 2019) from Knowledge-Enriched Transformer (Zhong et al., 2019b), which is the current stateof-the-art model. We notice that there are barely miss-classifications among the non-neutral categories (Angry, Sad, and Happy). Most of the errors, shown in the bold font, correspond to the confusion between a few non-neutral categories and much more neutral category (Others).
However, there are several challenges when analyzing emotion in natural conversations. Firstly, unlike vanilla emotion recognition of sentences (Wang and Manning, 2012;Seyeditabari et al., 2018), ERTC requires comprehensively considering the context in the conversation. Secondly, knowledge plays an *The first two authors contributed equally.
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ important role in ERTC as speakers often express emotions relying on the context and commonsense knowledge (Zhong et al., 2019b). Moreover, most utterances contain neutral emotion in conversations, and the heavily imbalanced class distribution can easily lead to the confusion between a few non-neutral utterances (e.g., happy, sad, and angry, etc.) and much more neutral ones (e.g., neutral or others), which restrains the emotion recognition performance. Table 1 shows a confusion matrix of emotion recognition results from the current state-of-the-art model and there appears a serious confusion between a few nonneutral categories and much more neutral category. Some prior studies have been conducted to model contextual information for emotion recognition in conversations (Poria et al., 2017;. These methods first adopt convolutional neural networks (CNN) to extract utterance-level features and then use context-level recurrent neural networks (RNN) to model the contextual utterances in conversation. However, RNN and CNN have difficulty modeling long-distance dependencies (Vaswani et al., 2017), which may be useful in ERTC. Zhong et al. (2019b) uses a context-aware affective graph attention mechanism to incorporate external knowledge for ERTC. However, they don't consider various relations in external knowledge base, which may cause the loss of semantic information. In addition, to the best of our knowledge, no existing work considers the confusion between a few non-neutral utterances and much more neutral ones.
In this paper, We propose a novel Knowledge Aware Incremental Transformer with Multi-task Learning (KAITML) to address the aforementioned challenges. Firstly, we enhance the background and semantic information of the given utterance to facilitate ERTC with the retrieved relevant knowledge graphs from a large-scale commonsense knowledge base. Specifically, we propose a dual level graph attention mechanism to encode these relevant knowledge graphs, which consists of a node-level attention to learn the importance of different neighboring nodes and a relation-level attention to learn the importance of different relations to the current node. Then we apply the Incremental Transformer (Li et al., 2019) to incrementally encode multi-turn contextual utterances, which could capture the intra-utterance and interutterance correlations by the self-attention (Cheng et al., 2016) and context-attention  modules, respectively. Moreover, we introduce multi-task learning to alleviate the confusion between a few non-neutral utterances and much more neutral ones, as shown in Table 1. Specifically, we first focus on the binary classification, "non-neutral" versus "neutral", and then classifies the "non-neutral" ones into fine-grained emotion categories. These two auxiliary tasks are jointly trained with the original emotion recognition task.
In summary, this paper makes the following contributions: • We devise a dual-level graph attention mechanism to support better understanding of utterances for ERTC by considering various relations in external knowledge base. Furthermore, we apply the Incremental Transformer to model multi-turn contextual utterances and recognize emotions.
• We are the first to introduce multi-task learning with two auxiliary tasks to alleviate the aforementioned confusion and thus further improve emotion recognition performance.
• Experimental results show that our proposed KAITML model outperforms the state-of-the-art models across five benchmark datasets in F1 score. In addition, context, commonsense knowledge and multi-task learning are all beneficial to the emotion recognition performance.

Related Work
Emotion recognition in conversations has grabbed much attention from researchers in the past few years due to the proliferation of publicly available conversational dataset (Poria et al., 2019a;Chatterjee et al., 2019;Li et al., 2017;Zhou et al., 2018a) and its widespread applications in opinion mining, recommender systems, emotion-aware dialogues generation, and so on (Poria et al., 2019b). Some of the deep learning-based models have been proposed for emotion recognition in conversations, in only textual and multimodal settings (containing textual, acoustic, and visual information). Poria et al. (2017) proposes a long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997) based model to capture contextual correlations from the utterances of a user-generated video for multimodal sentiment classification. Hazarika et al. (2018b) proposes conversational memory network (CMN) that exploits distinct memory units for each speaker to model emotional dynamics and detect emotion in a dyadic conversation. Later, Hazarika et al. (2018a) improves upon this approach with interactive conversational memory network (ICON), which utilizes the interactive memory unit to hierarchically model the self-and inter-speaker emotional influences for emotion recognition in conversational videos.  proposes the DialogueRNN model that exploits three gated recurrent units (GRU) (Cho et al., 2014) to capture speaker information, context and emotional information of the preceding utterances, respectively. They achieve the state-of-the-art performance on several multimodal conversation datasets. Compared to these gated RNNs and CNNs based models, we apply the Incremental Transformer (Li et al., 2019) to incrementally encode multi-turn contextual utterances, where the shorter path of information flow in the self-attention (Cheng et al., 2016) and context-attention  modules in the Incremental Transformer allows our model to exploit contextual information more efficiently.
Recently, a considerable literature has grown up around the theme of incorporating external knowledge in generative conversation systems, including question answering systems (Hao et al., 2017;Mihaylov and Frank, 2018), open-domain dialogue systems Zhou et al., 2018b;Zhong et al., 2019a), and task-oriented dialogue systems (He et al., 2019;Madotto et al., 2018;Chen et al., 2019). Zhong et al. (2019b) proposes a Knowledge-Enriched Transformer (KET) achieving the state-ofthe-art performance on multiple textual conversation datasets, where contextual utterances are encoded using hierarchical self-attention and commonsense knowledge is incorporated using a context-aware affective graph attention mechanism. However, they ignore various relations in external knowledge base, which may cause the loss of semantic information. By contrast, our dual-level graph attention mechanism, can take advantage of the various relations in external knowledge base to better augment the semantic information of the utterances.

Task Definition and Overview
Let X ..N, j = 1, ...N i be a collection of utterance, label pairs in a given conversation dataset, where N denotes the number of conversations and N i denotes the number of utterances in the ith conversation. The objective of the task is to maximize the following function: denotes the emotion label of target utterance, θ denotes the model parameters we need to optimize and X j−M denote contextual utterances. Here, we limit the number of contextual utterances to M . We follow (Su et al., 2018;Zhong et al., 2019b) to directly discard early contextual utterances. Similar to (Zhong et al., 2019b;Poria et al., 2017), we clip and pad each utterance X (i) j to a fixed K number of tokens. The overview of our KAITML model and detailed architecture of model components are presented in Figure 1.

Knowledge Interpreter
Commonsense knowledge is fundamental to understanding conversations (Zhou et al., 2018b). We use ConceptNet (Speer et al., 2017) as a external commonsense knowledge base in our model. ConceptNet is a large-scale multilingual semantic graph where concepts are nodes in the graph and relations are edges, which describes general human knowledge in natural language. Each concept1, relation, concept2 triple is termed an assertion. At present, ConceptNet comprises 5.9M assertions, 3.1M concepts and 38 relations for English.
The knowledge interpreter is designed to facilitate the understanding of an utterance. It takes as input an utterance X (i) n = x 1 x 2 ...x K , n = j − M, ..., j and retrieves a few relevant knowledge graphs G Enriched Embedding Incremental Transformer Task 1   Task 2   Task 3 Multi-task Learning {g 1 , g 2 , ..., g K } where each token in the utterance corresponds to a graph, as shown in Figure 1 (c). In general, the knowledge interpreter uses each token x k , k = 1, ..., K (non-stopword) in an utterance X (i) n as the key node to retrieve a graph g k comprising its immediate neighbors from ConceptNet, as shown in the red box in Figure 1 (c). For each g k , we remove nodes that are stopwords or not in our vocabulary. Each retrieved graph g k consists of a key node (the red dots) and its neighboring nodes (different colors denote different relations), where each node c is converted into a vector representation c ∈ R d , where d denotes the size of vector. Then, the knowledge interpreter computes the graph vector g k ∈ R d of the retrieved graph g k using the dual-level graph attention mechanism. We use a token embedding layer to convert each token x k in X (i) n into a vector representation x k ∈ R d . To encode positional information, the position encoding (Vaswani et al., 2017) is added as follows: Finally, the knowledge-enriched token embedding e k can be obtained via a linear transformation: where [; ] denotes concatenation and W ∈ R d×2d denotes a model parameter. All K tokens in X (i) n form a knowledge-enriched utterance embedding E (i) n ∈ R K×d that is then fed to the Incremental Transformer, as shown in Figure 1.

Dual-level Graph Attention Mechanism
The dual-level graph attention mechanism is designed to generate a representation for a retrieved knowledge graph, inspired by (Velickovic et al., 2018), which will be used to augment the semantics of each token in an utterance. Compared to (Velickovic et al., 2018), our graph attention considers not only all nodes in a graph but also relations between nodes. The dual-level graph attention mechanism, including node-level and relation-level attentions, can learn the importance of different neighboring nodes as well as the importance of different relations to a key node.
Node-level Attention. The node-level attention takes as input the node vectors F (g k ) = {c r s }, r = 1, ...R k , s = 1, ...N r in the retrieved knowledge graph g k , where R k denotes the number of relations in g k and N r denotes the number of nodes in the rth relation, to produce relation vectors t r as follows: Relation-level Attention. The relation-level attention takes as input the relation vectors t r , r = 1, ...R k , to produce a graph vector g k as follows: If |g k | = 0, where |g k | denotes the number of nodes in g k , we set g k to the average of all node vectors (Zhong et al., 2019b).

Incremental Transformer
We apply the Incremental Transformer (Li et al., 2019) to encode multi-turn contextual utterances, as shown in Figure 1, which contains Self-Attentive Encoder and Incremental Encoder.

Self-Attentive Encoder
The Self-Attentive Encoder is a transformer encoder as described in (Vaswani et al., 2017), which encodes the first utterance.
As shown in Figure 1 (a), the Self-Attentive Encoder contains a stack of L identical layers. Each layer has two sub-layers. The first sub-layer is a multi-head self-attention (MultiHead) (Vaswani et al., 2017). M ultiHead(Q, K, V ) is a multi-head attention function that takes a query matrix Q, a key matrix K, and a value matrix V as input. In current case, Q = K = V . That's why it's called selfattention. And the second sub-layer is a simple, position-wise fully connected feed-forward network (FFN). This FFN consists of two linear transformations with a ReLU activation in between, F F N (x) = max(0, xW 1 + b 1 )W 2 + b 2 , where W 1 , b 1 , W 2 , b 2 denote model parameters. (Vaswani et al., 2017) Formally, for the first knowledge-enriched utterance embedding E j−M ∈ R K×d is computed as follows: where l = 1, ..., L, C j−M ∈ R K×d is the hidden state computed by multi-head attention at the lth layer, C (i) [l] j−M ∈ R K×d denotes the representation of E (i) j−M after l layer. The residual connection and layer normalization are omitted in the presentation for simplicity. More details can be found in (Vaswani et al., 2017).

Incremental Encoder
The Incremental Encoder is a variant of the transformer encoder with an additional contextattention  module, which encodes multi-turn utterances using an incremental encoding scheme. It takes the output of previous utterances and current utterance as input, and use attention mechanism to incrementally model relevant context. As shown in Figure 1 (b), the Incremental Encoder contains a stack of L identical layers. Each layer has three sub-layers. For each knowledge-enriched utterance embedding E (i) n ∈ R K×d , n = j − M + 1, ..., j, its representation C (i)[L] n ∈ R K×d is computed as follows: The first sub-layer is a multi-head self-attention: where l = 1, ...L, C (i)[l−1] n ∈ R K×d is the output of the previous layer and C n . The second sub-layer is a multi-head context-attention: where C

(i)[L]
n−1 ∈ R K×d is the representation of the previous utterances after L layers. The third sub-layer is a position-wise fully connected feed-forward network: Finally, C ∈ R K×d is the representation of relevant context (including target utterance), as shown in Figure 1, which is then fed into a max-pooing layer to learn discriminative features among positions and derive the final representation O

Multi-task Learning
We introduce multi-task learning to alleviate the confusion between a few non-neutral categories (e.g., happy, sad, and angry, etc.) and much more neutral category (e.g., neutral or others) and thus further improve emotion recognition performance, as shown in Figure 1, which contains three different tasks. Task 1 is the original emotion recognition task, which predicts the emotion label, including non-neutral categories and neutral category, of target utterance X (i) j . Its loss on one sample X is computed as follows: Y 1 where W 1 ∈ R d×q and b 1 ∈ R q denotes model parameters, q denotes the number of categories, Y 1 (i) j ∈ R q denotes the predicted probability distribution of task 1, and Y 1 (i) j ∈ R q (one-hot vector, the corresponding category position is 1, and the remaining positions are 0) denotes the ground-truth probability distribution of task 1.
Task 2 focuses on the binary classification, "non-neutral" versus "neutral", which determines whether the target utterance X (i) j is "non-neutral" or "neutral". Its loss on one sample X is computed as follows: where W 2 ∈ R d×2 and b 2 ∈ R 2 denotes model parameters,Ŷ 2 (i) j ∈ R 2 denotes the predicted probability distribution of task 2, and Y 2 (i) j ∈ R 2 (one-hot vector) denotes the ground-truth probability distribution of task 2.
Task 3 classifies the "non-neutral" into fine-grained emotion categories. Its loss on one sample X is computed as follows: where W 3 ∈ R d×(q−1) and b 3 ∈ R q−1 denotes model parameters, q − 1 denotes the number of nonneutral categories,Ŷ 3 (i) j ∈ R q−1 denotes the predicted output of task 3, and Y 3 j is a vector of all zeros, otherwise it's a one-hot vector).
4 Experimental Setting

Datasets and Evaluations
We evaluate our model on the following five benchmark datasets. Some of datasets, such as MELD, IEMOCAP, EmoryNLP, are multimodal conversation datasets containing textual, acoustic, and visual information. In this paper, we recognize emotion in conversations only based on textual information. The statistics and evaluation metrics of these datasets are drawn in Table 2   EmoryNLP (Zahiri and Choi, 2018): Scripts collected from the Friends TV series as well. Its emotion labels include sad, mad, scared, powerful, peaceful, joyful and neutral, which are different from MELD. IEMOCAP (Busso et al., 2008): Two-way emotional conversation. Its emotion labels include happiness, sadness, anger, frustrated, excited and neutral. The evaluation metric of each dataset is the same as the one used in (Zhong et al., 2019b).

Baselines
We compare our proposed model with the following baselines: cLSTM: It first adopt a bidirectional LSTM to extract utterance-level features and then use a contextlevel unibidirectional LSTM to model the contextual utterances. CNN (Kim, 2014): A single-layer CNN is trained on utterance-level without context. CNN+cLSTM (Poria et al., 2017): It first adopt an CNN to extract utterance-level features and then apply a context-level unibidirectional LSTM to learn context representations. BERT BASE (Devlin et al., 2019): Base version of Bert. It takes as input each utterance with its context as a single text.
DialogueRNN : It exploits three gated recurrent units (GRU) to capture speaker information, context and emotional information of the preceding utterances, respectively.
KET (Zhong et al., 2019b): It's the state-of-the-art model for ERTC, where contextual utterances are encoded using hierarchical self-attention and commonsense knowledge is incorporated using a contextaware affective graph attention mechanism.

Hyper-parameter Settings
We use Adam optimizer (Kingma and Ba, 2015) to train our model with learning rate of 0.0001 and a batch size of 64 in all datasets. We set the class weights in cross-entropy loss as the ratio of the class distribution in the validation set to the class distribution in the training set for each dataset (Zhong et al., 2019b). Thus, we can tackle the mismatch in class distribution between validation set and training set. The initial token and node embeddings are pre-trained with GloVe (Pennington et al., 2014). The detailed hyper-parameter settings for KAITML are presented in Table 3.   Table 4: Performace comparisons on the five test sets (%). Bold font denotes the best performance. Table 4 shows the performance of different models on 5 benchmark datasets. We can see that our model outperforms all the baselines, on all the datasets, which shows the effectiveness of our proposed model for ERTC. Through paired t-test, there were significant differences between our proposed model and all baselines (p ≤ 0.05). Note that all the results of baselines are directly cited from (Zhong et al., 2019b). The state-of-the-art KET model performs best overall among all baselines. And our KAITML model surpasses the KET model by around 1.5% performance on all the dataset tested. To explain this gap in performance, it's significant to understand the nature of these models. KAITML and KET both incorporate external commonsense knowledge and model contextual information based on transformer for ERTC. This is a key limitation in other baseline models, as external commonsense knowledge can enrich the background and semantic information of utterances and the self-attention module in transformer allows model to exploit contextual information more efficiently than CNNs and gated RNNs in other baseline models. As for the difference of performance between KAITML and KET, we believe that this is due to the difference of graph attention mechanism and multi-task learning. That KET doesn't consider various relations in external knowledge base may cause the loss of semantic information. By contrast, KAITML tries to overcome this issue by using a dual-level graph attention mechanism, which can exploit the various relations in external knowledge base and thus support better understanding of utterances. In addition, the multi-task learning in KAITML can alleviate the confusion between a few non-neutral utterances and much more neutral ones and thus further improve the emotion recognition performance.  Table 5: Ablation results on five validation sets (%). Context, commonsense knowledge and multi-task learning are all beneficial to the emotion recognition performance.

Ablation
To comprehensively study the impact of context, knowledge and multi-task learning, we remove them one at a time and investigate their contribution on all datasets. As expected, following Table 5, context, knowledge and multi-task learning are all essential to the strong performance of our model on all datasets and their combination achieves the best performance. Note that removing knowledge has a greater impact on small datasets (i.e., EmoryNLP and IEMOCAP) than big datasets (i.e., EC, DailyDialogue and MELD), which is expected because external commonsense knowledge can help model understand utterances, especially when there is insufficient data. Moreover, compared to other datasets, the performance of the EC drops a lot, around 6%, when removing context. The reason may be that there are more short utterances on EC, like "ok", "yes", whose emotion depends on the context it appears in. With multi-task learning, we observed that the confusion between non-neutral categories and neutral category is alleviated in the confusion matrix and the performance improves by about 1.2% on all datasets on average.

Error Analysis
By analyzing our predicted emotion labels, we found that the model error is mainly caused by the following aspects. Firstly, misclassifications are often among similar emotion classes in the confusion matrix, like 'happy' and 'excited', 'angry' and 'frustrated'. Secondly, the performance of emotion classes with small amount data available is poor, like 'fear' and 'disgust' in DailyDialogue dataset. Thirdly, some of datasets, such as MELD, IEMOCAP, EmoryNLP, that we use in our experiment are multimodal. And we found that acoustic, and visual modality provide key information to recognize emotions in a few utterances (e.g., 'okay', 'yes', etc.) while our proposed KAITML model considers only textual modality.

Conclusion and Future Work
In this paper, we propose a novel Knowledge Aware Incremental Transformer with Multi-task Learning (KAITML) for emotion recognition in textual conversations, which can effectively incorporate contextual information and commonsense knowledge, and alleviate the confusion between a few non-neutral utterances and much more neutral ones. Moreover, extensive experimental results show that our KAITML model outperforms state-of-the-art models across five benchmark dataset. Future work will focus on the following directions: 1) how to differentiate similar emotions, 2) how to recognize emotion using limited data, 3) how to incorporate multimodal information for emotion recognition in conversations.