Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations

Messages in human conversations inherently convey emotions. The task of detecting emotions in textual conversations leads to a wide range of applications such as opinion mining in social networks. However, enabling machines to analyze emotions in conversations is challenging, partly because humans often rely on the context and commonsense knowledge to express emotions. In this paper, we address these challenges by proposing a Knowledge-Enriched Transformer (KET), where contextual utterances are interpreted using hierarchical self-attention and external commonsense knowledge is dynamically leveraged using a context-aware affective graph attention mechanism. Experiments on multiple textual conversation datasets demonstrate that both context and commonsense knowledge are consistently beneficial to the emotion detection performance. In addition, the experimental results show that our KET model outperforms the state-of-the-art models on most of the tested datasets in F1 score.


Introduction
Emotions are "generated states in humans that reflect evaluative judgments of the environment, the self and other social agents" (Hudlicka, 2011).Messages in human communications inherently convey emotions.With the prevalence of social media platforms such as Facebook Messenger, as well as conversational agents such as Amazon Alexa, there is an emerging need for machines to understand human emotions in natural conversations.This work addresses the task of detecting emotions (e.g., happy, sad, angry, etc.) in textual conversations, where the emotion of an utterance is detected in the conversational context.Being able to effectively detect emotions in conversations leads to a wide range of applications ranging from opinion mining in social media platforms  (Li et al., 2017).By referring to the context, "it" in the third utterance is linked to "birthday" in the first utterance.By leveraging an external knowledge base, the meaning of "friends" in the forth utterance is enriched by associated knowledge entities, namely "socialize", "party", and "movie".Thus, the implicit "happiness" emotion in the fourth utterance can be inferred more easily via its enriched meaning.(Chatterjee et al., 2019) to building emotion-aware conversational agents (Zhou et al., 2018a).
However, enabling machines to analyze emotions in human conversations is challenging, partly because humans often rely on the context and commonsense knowledge to express emotions, which is difficult to be captured by machines.Figure 1 shows an example conversation demonstrating the importance of context and commonsense knowledge in understanding conversations and detecting implicit emotions.
There are several recent studies that model contextual information to detect emotions in conversations.Poria et al. (2017) and Majumder et al. (2019) leveraged recurrent neural networks (RNN) to model the contextual utterances in sequence, where each utterance is represented by a feature vector extracted by convolutional neural networks (CNN) at an earlier stage.Similarly, Hazarika et al. (2018a,b) proposed to use extracted CNN features in memory networks to model contextual utterances.However, these methods require separate feature extraction and tuning, which may not be ideal for real-time applications.In addition, to the best of our knowledge, no attempts have been made in the literature to incorporate commonsense knowledge from external knowledge bases to detect emotions in textual conversations.Commonsense knowledge is fundamental to understanding conversations and generating appropriate responses (Zhou et al., 2018b).
To this end, we propose a Knowledge-Enriched Transformer (KET) to effectively incorporate contextual information and external knowledge bases to address the aforementioned challenges.The Transformer (Vaswani et al., 2017) has been shown to be a powerful representation learning model in many NLP tasks such as machine translation (Vaswani et al., 2017) and language understanding (Devlin et al., 2018).The self-attention (Cheng et al., 2016) and cross-attention (Bahdanau et al., 2014) modules in the Transformer capture the intra-sentence and inter-sentence correlations, respectively.The shorter path of information flow in these two modules compared to gated RNNs and CNNs allows KET to model contextual information more efficiently.In addition, we propose a hierarchical self-attention mechanism allowing KET to model the hierarchical structure of conversations.Our model separates context and response into the encoder and decoder, respectively, which is different from other Transformer-based models, e.g., BERT (Devlin et al., 2018), which directly concatenate context and response, and then train language models using only the encoder part.
Moreover, to exploit commonsense knowledge, we leverage external knowledge bases to facilitate the understanding of each word in the utterances by referring to related knowledge entities.The referring process is dynamic and balances between relatedness and affectiveness of the retrieved knowledge entities using a context-aware affective graph attention mechanism.
In summary, our contributions are as follows: • For the first time, we apply the Transformer to analyze conversations and detect emotions.
Our hierarchical self-attention and crossattention modules allow our model to exploit contextual information more efficiently than existing gated RNNs and CNNs.
• We derive dynamic, context-aware, and emotion-related commonsense knowledge from external knowledge bases and emotion lexicons to facilitate the emotion detection in conversations.
• We conduct extensive experiments demonstrating that both contextual information and commonsense knowledge are beneficial to the emotion detection performance.In addition, our proposed KET model outperforms the state-of-the-art models on most of the tested datasets across different domains.

Related Work
Emotion Detection in Conversations: Early studies on emotion detection in conversations focus on call center dialogs using lexicon-based methods and audio features (Lee and Narayanan, 2005;Devillers and Vidrascu, 2006).Devillers et al. (2002) annotated and detected emotions in call center dialogs using unigram topic modelling.In recent years, there is an emerging research trend on emotion detection in conversational videos and multi-turn Tweets using deep learning methods (Hazarika et al., 2018b,a;Zahiri and Choi, 2018;Chatterjee et al., 2019;Zhong and Miao, 2019;Poria et al., 2019).Poria et al. (2017) proposed a long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997) based model to capture contextual information for sentiment analysis in user-generated videos.Majumder et al. (2019) proposed the DialogueRNN model that uses three gated recurrent units (GRU) (Cho et al., 2014) to model the speaker, the context from the preceding utterances, and the emotions of the preceding utterances, respectively.They achieved the stateof-the-art performance on several conversational video datasets.Knowledge Base in Conversations: Recently there is a growing number of studies on incorporating knowledge base in generative conversation systems, such as open-domain dialogue systems (Han et al., 2015;Asghar et al., 2018;Ghazvininejad et al., 2018;Young et al., 2018;Parthasarathi and Pineau, 2018;Liu et al., 2018;Moghe et al., 2018;Dinan et al., 2019;Zhong et al., 2019), task-oriented dialogue systems (Madotto et al., 2018;Wu et al., 2019;He et al., 2019) and question answering systems (Kiddon et al., 2016;Hao et al., 2017;Sun et al., 2018;Mihaylov and Frank, 2018).Zhou et al. (2018b) adopted structured knowledge graphs to enrich the interpretation of input sentences and help generate knowledgeaware responses using graph attentions.The graph attention in the knowledge interpreter (Zhou et al., 2018b) is static and only related to the recognized entity of interest.By contrast, our graph attention mechanism is dynamic and selects context-aware knowledge entities that balances between relatedness and affectiveness.

Emotion Detection in Text:
There is a trend moving from traditional machine learning methods (Pang et al., 2002;Wang and Manning, 2012;Seyeditabari et al., 2018) to deep learning methods (Abdul-Mageed and Ungar, 2017;Zhang et al., 2018b) for emotion detection in text.Khanpour and Caragea (2018) investigated the emotion detection from health-related posts in online health communities using both deep learning features and lexicon-based features.
Incorporating Knowledge in Sentiment Analysis: Traditional lexicon-based methods detect emotions or sentiments from a piece of text based on the emotions or sentiments of words or phrases that compose it (Hu et al., 2009;Taboada et al., 2011;Bandhakavi et al., 2017).Few studies investigated the usage of knowledge bases in deep learning methods.Kumar et al. (2018) proposed to use knowledge from WordNet (Fellbaum, 2012) to enrich the text representations produced by LSTM and obtained improved performance.
Transformer: The Transformer has been applied to many NLP tasks due to its rich representation and fast computation, e.g., document machine translation (Zhang et al., 2018a), response matching in dialogue system (Zhou et al., 2018c), language modelling (Dai et al., 2019) and understanding (Radford et al., 2018).A very recent work (Rik Koncel-Kedziorski and Hajishirzi, 2019) extends the Transformer to graph inputs and propose a model for graph-to-text generation.

Our Proposed KET Model
In this section we present the task definition and our proposed KET model.

Task Definition
Let {X i j , Y i j }, i = 1, ...N, j = 1, ...N i be a collection of {utterance, label} pairs in a given dialogue dataset, where N denotes the number of conversations and N i denotes the number of utterances in the ith conversation.The objective of the task is to maximize the following function: where X i j−1 , ..., X i 1 denote contextual utterances and θ denotes the model parameters we want to optimize.
We limit the number of contextual utterances to M .Discarding early contextual utterances may cause information loss, but this loss is negligible because they only contribute the least amount of information (Su et al., 2018).This phenomenon can be further observed in our model analysis regarding context length (see Section 5.2).Similar to (Poria et al., 2017), we clip and pad each utterance X i j to a fixed m number of tokens.The overall architecture of our KET model is illustrated in Figure 2.

Knowledge Retrieval
We use a commonsense knowledge base Con-ceptNet (Speer et al., 2017) and an emotion lexicon NRC VAD (Mohammad, 2018a) as knowledge sources in our model.
ConceptNet is a large-scale multilingual semantic graph that describes general human knowledge in natural language.The nodes in ConceptNet are concepts and the edges are relations.Each concept1, relation, concept2 triplet is an assertion.Each assertion is associated with a confidence score.An example assertion is friends, CausesDesire, socialize with confidence score of 3.46.Usually assertion confidence scores are in the [1, 10] interval.Currently, for English, Con-ceptNet comprises 5.9M assertions, 3.1M concepts and 38 relations.
The VAD measure of emotion is culture-independent and widely adopted in Psychology (Mehrabian, 1996).Currently NRC VAD comprises around 20K words.
In general, for each non-stopword token t in X i j , we retrieve a connected knowledge graph g(t) comprising its immediate neighbors from Con-ceptNet.For each g(t), we remove concepts that are stopwords or not in our vocabulary.We further remove concepts with confidence scores less than 1 to reduce annotation noises.For each concept, we retrieve its VAD values from NRC VAD.The final knowledge representation for each token t is a list of tuples: where c k ∈ g(t) denotes the kth connected concept, s k denotes the associated confidence score, and VAD(c k ) denotes the VAD values of c k .The treatment for tokens that are not associated with any concept and concepts that are not included in NRC VAD are discussed in Section 3.4.We leave the treatment on relations as future work.

Embedding Layer
We use a word embedding layer to convert each token t in X i into a vector representation t ∈ R d , where d denotes the size of word embedding.To encode positional information, the position encoding (Vaswani et al., 2017) is added as follows: Similarly, we use a concept embedding layer to convert each concept c into a vector representation c ∈ R d but without position encoding.

Dynamic Context-Aware Affective Graph Attention
To enrich word embedding with concept representations, we propose a dynamic context-aware affective graph attention mechanism to compute the concept representation for each token.Specifically, the concept representation c(t) ∈ R d for token t is computed as where c k ∈ R d denotes the concept embedding of c k and α k denotes its attention weight.If |g(t)| = 0, we set c(t) to the average of all concept embeddings.The attention α k in Equation 3 is computed as where w k denotes the weight of c k .The derivation of w k is crucial because it regulates the contribution of c k towards enriching t.A standard graph attention mechanism (Velikovi et al., 2018) computes w k by feeding t and c k into a single-layer feedforward neural network.However, not all related concepts are equal in detecting emotions given the conversational context.In our model, we make the assumption that important concepts are those that relate to the conversational context and have strong emotion intensity.To this end, we propose a context-aware affective graph attention mechanism by incorporating two factors when computing w k , namely relatedness and affectiveness.
Relatedness: Relatedness measures the strength of the relation between c k and the conversational context.The relatedness factor in w k is computed as (5) where s k is the confidence score introduced in Section 3.2, min-max denotes min-max scaling for each token t, abs denotes the absolute function, cos denotes the cosine similarity function, and CR(X i ) ∈ R d denotes the context representation of the ith conversation X i .Here we compute CR(X i ) as the average of all sentence representations in X i as follows: where SR(X i j ) ∈ R d denotes the sentence representation of X i j .We compute SR(X i j ) via hierarchical pooling (Shen et al., 2018) where ngram (n ≤ 3) representations in X i j are first computed by max-pooling and then all n-gram representations are averaged.The hierarchical pooling mechanism preserves word order information to certain degree and has demonstrated superior performance than average pooling or max-pooling on sentiment analysis tasks (Shen et al., 2018).Affectiveness: Affectiveness measures the emotion intensity of c k .The affectiveness factor in w k is computed as denote the valence and arousal values of VAD(c k ), respectively.Intuitively, aff k considers the deviations of valence from neutral and the level of arousal from calm.There is no established method in the literature to compute the emotion intensity based on VAD values, but empirically we found that our method correlates better with an emotion intensity lexicon comprising 6K English words (Mohammad, 2018b) than other methods such as taking dominance into consideration or taking l 1 norm.For concept c k not in NRC VAD, we set aff k to the mid value of 0.5.
Combining both rel k and aff k , we define the weight w k as follows: where λ k is a model parameter balancing the impacts of relatedness and affectiveness on computing concept representations.Parameter λ k can be fixed or learned during training.The analysis of λ k is discussed in Section 5.2.Finally, the concept-enriched word representation t can be obtained via a linear transformation: where [; ] denotes concatenation and W ∈ R d×2d denotes a model parameter.All m tokens in each X i j then form a concept-enriched utterance embedding Xi j ∈ R m×d .

Hierarchical Self-Attention
We propose a hierarchical self-attention mechanism to exploit the structural representation of conversations and learn a vector representation for the contextual utterances X i j−1 , ..., X i j−M .Specifically, the hierarchical self-attention follows two steps: 1) each utterance representation is computed using an utterance-level self-attention layer, and 2) a context representation is computed from M learned utterance representations using a context-level self-attention layer.
At step 1, for each utterance X i n , n=j − 1, ..., j − M , its representation X i n ∈ R m×d is learned as follows: ) where L( Xi n ) ∈ R m×h×ds is linearly transformed from Xi n to form h heads (d s = d/h), L linearly transforms from h heads back to 1 head, and where Q, K, and V denote sets of queries, keys and values, respectively, rameters, and p denotes the hidden size of the point-wise feedforward layer (FF) (Vaswani et al., 2017).The multi-head self-attention layer (MH) enables our model to jointly attend to information from different representation subspaces (Vaswani et al., 2017).The scaling factor 1 √ ds is added to ensure the dot product of two vectors do not get overly large.Similar to (Vaswani et al., 2017) At step 2, to effectively combine all utterance representations in the context, the contextlevel self-attention layer is proposed to hierarchically learn the context-level representation C i ∈ R M ×m×d as follows: where Xi denotes [ X i j−M ; ...; X i j−1 ], which is the concatenation of all learned utterance representations in the context.

Context-Response Cross-Attention
Finally, a context-aware concept-enriched response representation R i ∈ R m×d for conversation X i is learned by cross-attention (Bahdanau et al., 2014), which selectively attends to the concept-enriched context representation as follows: where the response utterance representation X i j ∈ R m×d is obtained via the MH layer: The resulted representation R i ∈ R m×d is then fed into a max-pooling layer to learn discriminative features among the positions in the response and derive the final representation O ∈ R d : The output probability p is then computed as where W 3 ∈ R d×q and b 3 ∈ R q denote model parameters, and q denotes the number of classes.
The entire KET model is optimized in an end-toend manner as defined in Equation 1.Our model is available at here1 .

Experimental Settings
In this section we present the datasets, evaluation metrics, baselines, our model variants, and other experimental settings.

Datasets and Evaluations
We evaluate our model on the following five emotion detection datasets of various sizes and domains.The statistics are reported in Table 1.
The emotion labels include happiness, sadness, anger and other.
DailyDialog (Li et al., 2017): Human written daily communications.The emotion labels include neutral and Ekman's six basic emotions (Ekman, 1992), namely happiness, surprise, sadness, anger, disgust and fear.MELD (Poria et al., 2018): TV show scripts collected from Friends.The emotion labels are the same as the ones used in DailyDialog.
EmoryNLP (Zahiri and Choi, 2018): TV show scripts collected from Friends as well.However, its size and annotations are different from MELD.The emotion labels include neutral, sad, mad, scared, powerful, peaceful, and joyful.IEMOCAP (Busso et al., 2008): Emotional dialogues.The emotion labels include neutral, happiness, sadness, anger, frustrated, and excited.
In terms of the evaluation metric, for EC and DailyDialog, we follow (Chatterjee et al., 2019) to use the micro-averaged F1 excluding the majority class (neutral), due to their extremely unbalanced labels (the percentage of the majority class in the test set is over 80%).For the rest relatively balanced datasets, we follow (Majumder et al., 2019) to use the weighted macro-F1.

Baselines and Model Variants
For a comprehensive performance evaluation, we compare our model with the following baselines: cLSTM: A contextual LSTM model.An utterance-level bidirectional LSTM is used to encode each utterance.A context-level unidirectional LSTM is used to encode the context.CNN (Kim, 2014): A single-layer CNN with strong empirical performance.This model is trained on the utterance-level without context.CNN+cLSTM (Poria et al., 2017): An CNN is used to extract utterance features.An cLSTM is then applied to learn context representations.
BERT BASE (Devlin et al., 2018): Base version of the state-of-the-art model for sentiment classification.We treat each utterance with its context as a single document.We limit the document length to the last 100 tokens to allow larger batch size.
We do not experiment with the large version of BERT due to memory constraint of our GPU.
DialogueRNN (Majumder et al., 2019): The stateof-the-art model for emotion detection in textual conversations.It models both context and speakers information.The CNN features used in Dia-logueRNN are extracted from the carefully tuned CNN model.For datasets without speaker information, i.e., EC and DailyDialog, we use two speakers only.For MELD and EmoryNLP, which have 260 and 255 speakers, respectively, we additionally experimented with clipping the number of speakers to the most frequent ones (6 main speakers + an universal speaker representing all other speakers) and reported the best results.KET SingleSelfAttn: We replace the hierarchical self-attention by a single self-attention layer to learn context representations.Contextual utterances are concatenated together prior to the single self-attention layer.
KET StdAttn: We replace the dynamic contextaware affective graph attention by the standard graph attention (Velikovi et al., 2018).

Other Experimental Settings
We preprocessed all datasets by lower-casing and tokenization using Spacy2 .We keep all tokens in the vocabulary3 .We use the released code for BERT BASE and DialogueRNN.For each dataset, all models are fine-tuned based on their performance on the validation set.
For our model in all datasets, we use Adam optimization (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001 throughout the training process.We use GloVe embedding (Pennington et al., 2014) for initialization in the word and concept embedding layers4 .For the class weights in cross-entropy loss for each dataset, we set them as the ratio of the class distribution in the validation set to the class distribution in the training set.Thus, we can alleviate the problem of unbalanced dataset.The detailed hyper-parameter settings for KET are presented in Table 3.

Result Analysis
In this section we present model evaluation results, model analysis, and error analysis.

Comparison with Baselines
We compare the performance of KET against that of the baseline models on the five afore-introduced datasets.The results are reported in Table 2.Note that our results for CNN, CNN+cLSTM and Di-alogueRNN on EC, MELD and IEMOCAP are slightly different from the reported results in (Majumder et al., 2019;Poria et al., 2019).cLSTM performs reasonably well on short conversations (i.e., EC and DailyDialog), but the worst on long conversations (i.e., MELD, EmoryNLP and IEMOCAP).One major reason is that learning long dependencies using gated RNNs may not be effective enough because the gradients are expected to propagate back through inevitably a huge number of utterances and tokens in sequence, which easily leads to the vanishing gradient problem (Bengio et al., 1994).In contrast, when the utterance-level LSTM in cLSTM is replaced by features extracted by CNN, i.e., the CNN+cLSTM, the model performs significantly better than cLSTM on long conversations, which further validates that modelling long conversations using only RNN models may not be sufficient.BERT BASE achieves very competitive performance on all datasets except EC due to its strong representational power via bi-directional context modelling using the Transformer.Note that BERT BASE has considerably more parameters than other baselines and our model (110M for BERT BASE versus 4M for our model), which can be a disadvantage when deployed to devices with limited computing power and memory.The state-of-the-art DialogueRNN model performs the best overall among all baselines.In particular, DialogueRNN performs better than our model on IEMOCAP, which may be attributed to its detailed speaker information for modelling the emotion dynamics in each speaker as the conversation flows.
It is encouraging to see that our KET model outperforms the baselines on most of the datasets tested.This finding indicates that our model is robust across datasets with varying training sizes, context lengths and domains.Our KET variants KET SingleSelfAttn and KET StdAttn perform comparably with the best baselines on all datasets except IEMOCAP.However, both variants perform noticeably worse than KET on all datasets except EC, validating the importance of our proposed hierarchical self-attention and dynamic context-aware affective graph attention mechanism.One observation worth mentioning is that these two variants perform on a par with the KET model on EC.Possible explanations are that 1) hierarchical self-attention may not be critical for modelling short conversations in EC, and 2) the informal linguistic styles of Tweets in EC, e.g., misspelled words and slangs, hinder the context representation learning in our graph attention mechanism.

Model Analysis
We analyze the impact of different settings on the validation performance of KET.All results in this section are averaged over 5 random seeds.Analysis of context length: We vary the context length M and plot model performance in Figure 3 (top portion).Note that EC has only a maximum number of 2 contextual utterances.It is clear that incorporating context into KET improves performance on all datasets.However, adding more context is contributing diminishing performance gain or even making negative impact in some datasets.This phenomenon has been observed in a prior study (Su et al., 2018).One possible explanation is that incorporating long contextual information may introduce additional noises, e.g., polysemes expressing different meanings in different utterances of the same context.More thorough investigation of this diminishing return phenomenon is a worthwhile direction in the future.Analysis of the size of ConceptNet: We vary the size of ConceptNet by randomly keeping only a fraction of the concepts in ConceptNet when train-  ing and evaluating our model.The results are illustrated in Figure 3 (bottom portion).Adding more concepts consistently improves model performance before reaching a plateau, validating the importance of commonsense knowledge in detecting emotions.We may expect the performance of our KET model to improve with the growing size of ConceptNet in the future.
Analysis of the relatedness-affectiveness tradeoff: We experiment with different values of λ k ∈ [0, 1] (see Equation 8) for all k and report the results in Table 4.It is clear that λ k makes a noticeable impact on the model performance.Discarding relatedness or affectiveness completely will cause significant performance drop on all datasets, with one exception of IEMOCAP.One possible reason is that conversations in IEMOCAP are emotional dialogues, therefore, the affectiveness factor in our proposed graph attention mechanism can provide more discriminative power.
Ablation Study: We conduct ablation study to investigate the contribution of context and knowledge as reported in Table 5.It is clear that both context and knowledge are essential to the strong performance of KET on all datasets.Note that removing context has a greater impact on long conversations than short conversations, which is expected because more contextual information is lost in long conversations.

Error Analysis
Despite the strong performance of our model, it still fails to detect certain emotions on certain datasets.We rank the F1 score of each emotion per dataset and investigate the emotions with the worst scores.We found that disgust and fear are generally difficult to detect and differentiate.For example, the F1 score of fear emotion in MELD is as low as 0.0667.One possible cause is that these two emotions are intrinsically similar.The VAD values of both emotions have low valence, high arousal and low dominance (Mehrabian, 1996).
Another cause is the small amount of data available for these two emotions.How to differentiate intrinsically similar emotions and how to effectively detect emotions using limited data are two challenging directions in this field.

Conclusion
We present a knowledge-enriched transformer to detect emotions in textual conversations.Our model learns structured conversation representations via hierarchical self-attention and dynamically refers to external, context-aware, and emotion-related knowledge entities from knowledge bases.Experimental analysis demonstrates that both contextual information and commonsense knowledge are beneficial to model performance.The tradeoff between relatedness and affectiveness plays an important role as well.In addition, our model outperforms the state-of-the-art models on most of the tested datasets of varying sizes and domains.
Given that there are similar emotion lexicons to NRC VAD in other languages and ConceptNet is a multilingual knowledge base, our model can be easily adapted to other languages.In addition, given that NRC VAD is the only emotion-specific component, our model can be adapted as a generic model for conversation analysis.

Figure 1 :
Figure 1: An example conversation with annotated labels from the DailyDialog dataset(Li et al., 2017).By referring to the context, "it" in the third utterance is linked to "birthday" in the first utterance.By leveraging an external knowledge base, the meaning of "friends" in the forth utterance is enriched by associated knowledge entities, namely "socialize", "party", and "movie".Thus, the implicit "happiness" emotion in the fourth utterance can be inferred more easily via its enriched meaning.

Figure 3 :
Figure 3: Validation performance by KET.Top: different context length (M ).Bottom: different sizes of random fractions of ConceptNet.

Table 1 :
, both MH and FF layers are followed by residual connection and layer normalization, which are omitted in Equation 10 for brevity.Dataset descriptions.

Table 2 :
Performance comparisons on the five test sets.Best values are highlighted in bold.

Table 3 :
Hyper-parameter settings for KET.M : context length.m: number of tokens per utterance.d: word embedding size.p: hidden size in FF layer.h: number of heads.

Table 4 :
Analysis of the relatedness-affectiveness tradeoff on the validation sets.Each column corresponds to a fixed λ k for all concepts (see Equation8).

Table 5 :
Ablation study for KET on the validation sets.