Summarize before Aggregate: A Global-to-local Heterogeneous Graph Inference Network for Conversational Emotion Recognition

Conversational Emotion Recognition (CER) is a crucial task in Natural Language Processing (NLP) with wide applications. Prior works in CER generally focus on modeling emotion influences solely with utterance-level features, with little attention paid on phrase-level semantic connection between utterances. Phrases carry sentiments when they are referred to emotional events under certain topics, providing a global semantic connection between utterances throughout the entire conversation. In this work, we propose a two-stage Summarization and Aggregation Graph Inference Network (SumAggGIN), which seamlessly integrates inference for topic-related emotional phrases and local dependency reasoning over neighbouring utterances in a global-to-local fashion. Topic-related emotional phrases, which constitutes the global topic-related emotional connections, are recognized by our proposed heterogeneous Summarization Graph. Local dependencies, which captures short-term emotional effects between neighbouring utterances, are further injected via an Aggregation Graph to distinguish the subtle differences between utterances containing emotional phrases. The two steps of graph inference are tightly-coupled for a comprehensively understanding of emotional fluctuation. Experimental results on three CER benchmark datasets verify the effectiveness of our proposed model, which outperforms the state-of-the-art approaches.


Introduction
Conversational Emotion Recognition (CER) has attracted increasing interests for its promising applications in intelligent interactive systems with diverse functionalities, including medical-care systems and online recommendation systems (Zhang et al., 2014;Gkotsis et al., 2016;Shen et al., 2020). As shown in Figure 1, conversations in CER datasets are segmented into multiple utterances based on breaths or pauses of the speaker, and each utterance is associated with an emotion label.
Existing works with deep learning approaches generally capture emotional features solely from an utterance-level perspective by modeling interactions between utterances via Recurrent Neural Network (RNN) structures (Hazarika et al., 2018b;Hazarika et al., 2018a; or graph structures (Ghosal et al., 2019). However, phrase-level semantic connection between utterances is still underexplored, creating a substantial barrier for comprehensively understanding of the source of emotional fluctuation. An example of a topic-related emotional phrase is shown in Figure 1, where the noun phrase "fifty dollars" is not an emotional expression at first, but under the topic of compensation, it has been associated with anger and frustration by referring to the unsatisfactory compensation. Therefore, it is crucial to recognize phrase-level emotional patterns globally across different utterances. Furthermore, topic-related emotional phrases generally scatter in different utterances throughout the conversation, and the emotions they convey depend on local context. To accurately distinguish emotions behind these "fifty dollars" phrases, the model is required to be comprehensively aware of both the topic about compensation and the subtle differences in attitude from the responses of the male speaker. By seamlessly This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons. org/licenses/by/4.0/. * indicates corresponding author.

Female Male
We'd be willing to give you fifty dollars to reimburse you for the bag.
All you're going to do is just give me fifty dollars and say go have fun on your vacation without any of your stuff?
What am I gonna do without anything for three weeks?

⋮ [FRUSTRATED]
There are some shops here in the airport.

[NEURAL]
But fifty dollars isn't going to get me anything.

[ANGRY]
I realize this. But I have to ask you to move so I can help the next person.

⋮ [NEURAL]
Can I at least have my fifty dollars, please?

[FRUSTRATED]
It sounds like it's no big deal whatever. It's a big deal to me.

[ANGRY]
Figure 1: A sample conversation from the IEMOCAP dataset for demonstrating topic-related emotional phrases. The female speaker is complaining that her bag is lost during the flight and the amount of compensation provided by the airline is unsatisfactory. Originally, the noun phrase "fifty dollars" does not contain emotions, however in this context, it has become a topic-related emotional phrase repeatedly used by the female speaker to express her disappointment towards the airline service.
combining global features from topic-related emotional phrases and local features on local dependencies from neighbouring utterances, the emotions associated with each utterance are fully excavated in a global-to-local fashion.
Graphs with multiple types of nodes are called the heterogeneous graph and have been applied to various NLP tasks for aligning information from different domains (Yao et al., 2019;Tu et al., 2019;Yu et al., 2019;Wang et al., 2020), nevertheless it is still a relatively new territory for the CER task. A tricky issue for exploring phrase-level semantic connections is the recognition of topic-related emotional phrases, which requires a meticulous reasoning over different utterances to judge whether a phrase is referred to an emotional event under the topic. Intuitively, the heterogeneous graph can be applied to align among phrase-level and utterance-level features, providing a thorough reasoning of topic-related emotional phrases.
In this paper, we propose a two-stage Summarization and Aggregation Graph Inference Network (SumAggGIN) for ERC, which seamlessly integrates inference for topic-related emotional phrases and local dependency reasoning over neighbouring utterances in a global-to-local fashion. A heterogeneous Summarization Graph which consists of two types of nodes (i.e. utterance nodes and phrase nodes) is proposed to infer topic-related emotional phrases. This Summarization Graph enables information propagation between utterances and phrases via overlapping phrases and utterance phrase relations, thereby enhances utterance representation with summarized phrase-level semantic connections. Subsequently, an Aggregation Graph is constructed to further inject speaker-related local dependencies into the topic-aware utterance representation for capturing short-term emotional effects between neighbouring utterances. The Summarization Graph and the Aggregation Graph are tightly-coupled to bridge global topic-related phrase patterns and local speaker-related utterance-level features for a comprehensively understanding of emotional fluctuation.
The main contributions of our work are highlighted as follows: (1) We propose a novel two-stage Summarization and Aggregation Graph Inference Network (SumAggGIN), which comprehensively captures emotional influences in a global-to-local fashion.
(2) To the best of our knowledge, we are the first to construct a heterogeneous Summarization Graph for inferencing global emotional interactions across utterances based on phrases.
(3) Extensive experiments on three publicly available CER datasets demonstrate that our model attains a substantial improvement and achieves state-of-the-art performance.

Emotion Recognition in Conversation
Emotion recognition, i.e. sentiment analysis, is a fundamental task in NLP with wide applications (Xia et al., 2011;Liu, 2012). For medical-care systems, conversational emotion recognition can be incorporated to accurately monitor patients' emotional fluctuation and detect potential mental health issues (i.e. depression and suicidal intention) in time (Gkotsis et al., 2016;Korkontzelos et al., 2016;. Moreover, emotion recognition can bring benefits to online recommendation on social media platforms by modeling users' short-term preferences (Zhang et al., 2014;Shen et al., 2020). Recently, an increasing number of models have been proposed to solve CER using various structures. Poria et al. (2017) propose to capture context-aware utterance representation via a Bi-directional LSTM network. Additionally, they adopt an attention mechanism to re-weight the outputs for a more informative output. Memory networks proposed by Sukhbaatar et al. (2015) are performed by some prior works to capture speaker-related historical information. In CMN, proposed by Hazarika et al. (2018b), two distinct memory cells are employed to model dialogue history of the two speakers. Hazarika et al. (2018a) further extend CMN by utilizing another memory cell for modeling global emotional influences across speakers. For distinguishing participants in a multiparty conversation,  propose DialogueRNN with three GRUs (Tang et al., 2015) tracking individual participant states, global context and emotional states respectively. Jiao et al. (2020) proposes an attention gated hierarchical memory network for real-time emotion recognition without future context. The state-of-the-art DialogueGCN (Ghosal et al., 2019) leverages speaker and temporal dependency from an utterance level by performing graph convolution on a homogeneous graph with each utterance as a node. Nevertheless, none of these methods explicitly models global semantic interactions based on topic-related emotional phrases.

Applications of Heterogeneous Graph for NLP
Graphs comprised of multiple types of nodes are called the heterogeneous graph. These graphs are constructed to simulate the real-world scenario with multiple granularity levels of information included and have been widely adopted in solving NLP tasks. In text classification, Yao et al. (2019) propose to build a heterogeneous text graph for a corpus based on phrase co-occurrence and term frequency-inverse document frequency (TF-IDF) weights. Tu et al. (2019) design a heterogeneous document-entity graph with candidates, documents and entities in specific document context, facilitating accurate reasoning for multi-hop reading comprehension. For the task of visual commonsense reasoning, Yu et al. (2019) construct a vision-to-answer heterogeneous graph and a question-to-answer heterogeneous graph to bridge the proper semantic alignment between vision and linguistic domains. Wang et al. (2020) propose to capture relations between sentences by constructing a heterogeneous graph network consisting of semantic units of different granularity for extractive document summarization. These prior studies inspire us to align among phrase-level and utterance-level features for recognizing topic-related emotional phrases via a heterogeneous graph.

Methodology
Before diving into the details of our proposed model, we begin by introducing the basic mathematical notations and terminologies for the task of CER. The goal of CER is to infer the emotion label (happy, sad, neutral, angry, excited, and frustrated) for each utterance in a conversation. Given a CER dataset D, an example of the dataset is denoted as .., U n } represents the n utterances of the conversation with each utterance U i containing l i words. Assuming there are M participants, the speaker corresponds to the i-th utterance U i is represented as s i ∈ {0, ..., M − 1}. y i ∈ {0, ..., N − 1} indicates the emotion label for utterance U i .
An overview of our proposed SumAggGIN model is shown in Figure 2, which consists of an Encoding module, a Summarization Graph, an Aggregation Graph and a Classification module.

Aggregation Graph
Reasoning

Utterance-level Features
Phrase-level Features Utterance node

Classification
Towards the same speaker Towards a different speaker Figure 2: The architecture of our SumAggGIN model. For simplicity, we demonstrate our graph structures via a dyadic conversation with 4 utterances, and each utterance includes the same 5 overlapping phrases. s1 and s2 represent the two distinct speakers in the conversation. ⊕ denotes the concatenation operation.

Encoding
For words in utterances, we convert them into 300 dimensional pretrained 840B GloVe word embeddings (Pennington et al., 2014). A TextCNN (Kim, 2014) is performed to capture n-grams information from each utterance U i . We use convolution filters of sizes 3, 4 and 5 with each filter containing 50 feature maps. The outputs of convolutions are further processed by max-pooling and ReLU activation (Nair and Hinton, 2010). We concatenate these activation results and feed them to a 150 dimensional fully connected layer, whose outputs are denoted as {u i } n i=1 . Subsequently, based on local utterance features from TextCNN, we apply a bidirectional LSTM (BiLSTM) to capture sequential contextual information. We denote v i ∈ R d as the sequential context-aware utterance representation for the i-th utterance and d is the hidden size of BiLSTM.

Summarization Graph
In this section, a heterogeneous Summarization Graph is constructed to recognize topic-related emotional phrases so as to explicitly model topic-related emotional interactions throughout the entire conversation. Phrases are extracted from utterances by TextRank (Mihalcea and Tarau, 2004). Through exchange of information between utterances and phrases on the Summarization Graph, utterance representation can be enhanced with summarized phrase-level semantic connections from a global perspective.

Summarization Graph Construction
We denote our Summarization Graph as G sum = (V, E sum ), where V = V o ∪ V u represents a node set composing of phrase nodes and utterance nodes and E sum stands for edges between nodes. V o = {o 1 , ..., o m } and V u = {U 1 , ..., U n } represent the m key phrases of the utterances and n utterances in the conversation, respectively. e ij ≥ 0 denotes the weight of the edge between the i-th phrase and the j-th utterance. In particular, e ij = 0 indicates that the i-th phrase does not appear in the j-th utterance. Self-loops are included to ensure that the original features of each node can be preserved in the course of message propagation. For phrase nodes in V o , their feature vectors are initialized by averaging pretrained GloVe embeddings of the constituting words. As for utterance nodes U i ∈ V u , they are initialized with the corresponding sequential context-aware utterance representation v i obtained from BiLSTM. Therefore, the feature matrices of phrase and utterance nodes are denoted as X o ∈ R m×dw and X u ∈ R n×2d respectively, where d w is the dimension of the word embedding. In our experiments, we have d w = 2d.
To infuse relation importance about the edge between a phrase node and an utterance node, we use TF-IDF weights of the phrase in the utterance as suggested by Yao et al. (2019). Term frequency is the number of times phrase o i occurs in an utterance U j , while inverse document frequency represents the logarithmically scaled inverse fraction of the number of utterances containing the phrase o i .

Message Propagation
We apply a variant of Graph Attention Network (GAT) (Veličković et al., 2018) to propagate information among nodes in the Summarization Graph. The hidden states of input nodes are denoted as g i ∈ R 2d×1 , i ∈ {1, ..., (m + n)}. A Multi-Layer Perceptron (MLP) is applied to compute attention coefficients between a node i and its neighbor j (j ∈ N i ) at layer t: where W t a and W t b are trainable parameters at the t-th layer, ⊕ denotes the concatenation operation, and N i denotes the set of neighbors of node i. Subsequently, the coefficients are normalized using the softmax function: .
( 2) Finally, we utilize the normalized attention coefficients to compute a linear combination of the neighbouring features. The updated feature vector for node i at the t-th layer is formulated as: ( 3) Although utterance nodes are not directly connected, stacking 2 layers of GAT enables the indirect exchange of information between pairs of utterances through co-appearing phrases. Inspired by Transformer (Vaswani et al., 2017), we further apply a position-wise feed-forward (FFN) layer after each GAT layer. The summarized representation for the i-th utterance after propagation is denoted as g i = g (2) i .

Aggregation Graph
The utterance representation obtained from Summarization Graph mainly captures global topic-related emotional interactions throughout the whole conversation. To further explore short-term emotional effects between neighbouring utterances, we construct an Aggregation Graph for modeling speaker-related context dependencies from a local perspective.

Aggregation Graph Construction
An Aggregation Graph can be denoted as G agg = (V u , E agg , R), where V u represents the node set containing utterance nodes solely, E agg stands for edges between nodes, and R denotes the type of the edges. Each utterance node U i ∈ V u is initialized with the corresponding summarized utterance representation g i obtained from the Summarization Graph.
To explicitly model speaker dependencies between utterances, we divide edges in E agg into 2 categories, i.e. edges towards the same speaker and edges towards a different speaker. To capture emotional patterns only from neighbouring utterances, we construct the edges by keeping a context window size of W . As a result, each utterance node U i only links to W utterances in the past (U i−W , U i−W +1 , ..., U i−1 ) and W utterances in the future (U i+1 , U i+2 , ..., U i+W ). The edge weights z ij are obtained from the cosine similarity between the feature vectors h i and h j of the two utterance nodes U i and U j : To ensure that for each utterance node, the incoming set of edges receives a total weight contribution of 1. The edge weights are further normalized by the softmax function:

Message Propagation
To pass messages between neighbouring utterance nodes, graph convolution is performed on the basis of the Aggregation Graph. A message passing strategy concerning different types of edges from (Schlichtkrull et al., 2018) is adopted: where N r i represents the set of neighbors of node i under edge type r ∈ R, β i,j and β i,i are the normalized edge weights. The normalization constant c i,r is set to be |N r i |, the number of neighbouring nodes of node i under edge type r. W

Classification
The aggregated utterance representation h i from the Aggregation Graph is fed into a fully-connected network to obtain the final prediction results for emotion classification: where W c and b c are trainable parameters of the classifier.

Datasets and Evaluation Metrics
We evaluate the performance of our SumAggGIN model on three benchmark datasets for the CER task, i.e. IEMOCAP (Busso et al., 2008), AVEC (Schuller et al., 2012), and MELD , which are also used by several prior works, including DialogueRNN  and Dia-logueGCN (Ghosal et al., 2019). These multimodal datasets originally contain textual, visual and acoustic information about the utterances in each conversation. However, for the task of CER, we only focus on textual modality to conduct our experiments. IEMOCAP: The Interactive Emotional Dyadic Motion Capture database (Busso et al., 2008) is a multimodal dataset consisting of videos of dyadic sessions. Each video contains a single conversation, which is segmented into multiple utterances. Each utterance is annotated with one of six emotion labels, i.e. happy, sad, neural, angry, excited and frustrated. The dataset is officially split into a training set consisting of 120 conversations with 5,810 utterances and a test set containing 31 conversations with 1,623 utterances. We randomly select 10% of training conversations as evaluation split for selecting hyperparameters. We use weighted average of accuracy and f1-score for evaluating the overall performance.
AVEC: The continuous Audio/Visual Emotion Challenge (Schuller et al., 2012) dataset is a modification of SEMAINE database (McKeown et al., 2012), which records interactions between human users and virtual agents. Different from IEMOCAP, utterances in the AVEC dataset are labeled every 0.2 second with four real valued attributes: valence ([−1, 1]), arousal ([−1, 1]), expectancy ([−1, 1]), and power ([0, ∞)). Following , the attributes are averaged over the span of an utterance to obtain utterance-level annotations. The standard split of AVEC dataset contains 63/32 conversations (4,368/1,430 utterances) for training and testing. 10% of the training conversations are randomly selected for evaluation. Mean Absolute Error (MAE) for the regression task is applied for model evaluation.
MELD: the Multimodal EmotionLines Dataset  is a multimodal and multiparty dataset extended from the EmotionLines dataset (Hsu et al., 2018). It contains more than 1,400 conversations and 13,000 utterances from the Friends TV series. The utterances are annotated with one of the seven emotion labels (anger, disgust, sadness, joy, surprise, fear and neutral). Following , we split the dataset into 1,039/114/280 conversations (9,989/1,109/2,610 utterances) for training, evaluation and testing. We use weighted average of f1-score to evaluate our model.

Baselines
We choose the following baseline models to evaluate the performance of our SumAggGIN model: TextCNN (Kim, 2014) is the baseline convolutional neural network based model. It is a subcomponent of our Encoding module (Section 3.1), which captures n-grams information from each utterance independently. However, it is not capable of capturing context-aware information from the surrounding utterances.
bc-LSTM+Att (Poria et al., 2017) utilizes Bi-directional LSTM network for capturing context-aware utterance representations. These representations are speaker agnostic for the reason that they are encoded irrespective of their speakers. Additionally, an attention mechanism is adopted to re-weight features and provide a more informative output.
CMN (Hazarika et al., 2018b) exploits two distinct GRUs for modeling the historical utterances of two speakers, and generates speaker-aware utterance representation.
ICON (Hazarika et al., 2018a) extends CMN by incorporating inter-speaker emotional influences into the original output of CMN, which only takes self-speaker historical information into consideration.
DialogueRNN  is a recurrent network with three GRUs capturing speaker states, global context and emotional states respectively. It can be applied on multiparty datasets for distinguishing different participants in a conversation interactively.
AGHMN (Jiao et al., 2020) proposes an attention gated hierarchical memory network which keeps track of the individual party states throughout the conversation for real-time emotion recognition without future context.
DialogueGCN (Ghosal et al., 2019) is the state-of-the-art model for the CER task. It employs graph convolutional network for capturing self-and inter-speaker influences between utterances.

Implementation Details
All experiments are carried out using a single NVIDIA Tesla M40 24GB card. The batch size is set to 16. We adopt Adam (Kingma and Ba, 2015) as the optimizer with an initial learning rate of 3e-4 and L2 weight decay of 1e-5. We use an early stopping strategy on f1-score (IEMOCAP and MELD) and MAE (AVEC) of the validation set, with a patience of 20 epochs. Hyper-parameters are tuned on the validation set. The hidden size of the Summarization Graph and the Aggregation Graph is set to 100 for IEMOCAP/AVEC and 200 for MELD. We set the context window size W in the Aggregation Graph to 10 for IEMOCAP/AVEC and 5 for MELD. Dropout rate with 0.5 is applied to avoid over-fitting.

Overall Performance
As shown in Table 1, we compare the performance of our proposed SumAggGIN model with the stateof-the-art DialogueGCN and other strong baseline models on textual modality using three CER datasets. The results reveal that our SumAggGIN model attains a substantial improvement compared to the stateof-the-art on all three datasets. Our SumAggGIN model pushes up state-of-the-art results by 1.51% and 2.43% in terms of weighted average accuracy and f1-score on IEMOCAP dataset, respectively. On AVEC dataset, our model outperforms on all four attributes compared to the state-of-the-art DialogueGCN. As for the MELD dataset, our SumAggGIN model attains the state-of-the-art average f1-score of 58.45%.

Ablation Study
In order to examine the effectiveness of our proposed SumAggGIN model, we perform ablation test with respect to the two graph inference modules on IEMOCAP dataset, and the results are listed in Table 2. The two modules are removed one at a time to examine its contribution on the entire model. w/o Aggregation Graph In this setting, the Aggregation Graph is removed from the full model. From the result, we can observe a 2.87% and 2.50% drop in terms of f1-score and accuracy, indicating the importance of the Aggregation Graph for capturing speaker-related local dependencies on short-term emotional effects between neighbouring utterances. w/o Summarization Graph If we only consider local dependencies from the Aggregation Graph and remove global connections from the Summarization Graph, f1-score and accuracy degrade by 4.03% and 3.98%, respectively. This proves that the global connections via topic-related emotional phrases are of vital importance for emotional recognition.
w/o inference graphs Here, we remove both the Aggregation Graph and the Summarization Graph, and use the sequential context-aware utterance representation output by BiLSTM for emotion recognition. The experimental results show a significant drop of 10.18% and 8.60% in terms of f1-score and accuracy, which demonstrates the efficacy of modeling emotional influences via a two-stage graph inference network in a global-to-local fashion.

Visualization
Figure 3(a) shows the comparison of f1-score between our SumAggGIN model and other three baselines on six emotion labels of the IEMOCAP dataset. The bc-LSTM+Att model, which only takes sequentially encoded context information into consideration, attains an inferior performance on all six labels compared to the other models, which proves the importance of modeling speaker-related local dependencies between utterances in the conversation. Our SumAggGIN model surpasses AGHMN on four emotion labels (i.e. 'happy', 'sad', 'neutral' and 'excited') and achieves competitive results on 'angry' and 'frustrated'. We believe our improvement comes from the graph structures we designed for modeling utterance connections. For the conversations with over 100 utterances in the IEMOCAP dataset, the RNN structure in AGHMN for dependency modeling suffers from long-term information propagation issues resulted from gradient vanishing problem. On the contrary, our Summarization Graph and Aggregation Graph directly aggregate information from neighbouring utterances, which greatly shortens the path for information propagation and achieves a better result in modeling emotions. The comparison between our SumAggGIN and another graph-based method DialogueGCN shows that our model outperforms DialogueGCN on the positive emotion labels (i.e. 'happy' and 'excited') by a large margin, and attains comparable results on the other four labels. We surmise that the improvement comes from our proposed heterogeneous Summarization Graph, which effectively explores global semantic connections provided by topic-related emotional phrases.
We further investigate the effect of our Summarization Graph by examining the average number of overlapping phrases per utterance for each emotion label, as shown in Figure 3(b). For the utterances labeled with negative emotions (i.e. 'sad', 'angry' and 'frustrated'), the average number of overlapping phrases per utterance is much higher than that of the utterances with positive emotions. We argue that global semantic connections are beneficial to capture emotional patterns, but excessive connections from overlapping phrases result in bringing noise signals into the Summarization Graph and causing performance degradation.

Conclusion
In this work, we present a two-stage Summarization and Aggregation Graph Inference Network (SumAggGIN) for the CER task. Combining global topic-related emotional interactions from the heterogeneous Summarization Graph and short-term emotional effects from the Aggregation Graph, our model is capable of comprehensively capturing emotional influences in a global-to-local fashion. Experimental results show that our proposed SumAggGIN outperforms the state-of-the-art approaches on three CER benchmark datasets. In the future, we plan on applying our two-stage graph inference framework to other multi-turn dialogue tasks.