End-to-End Emotion-Cause Pair Extraction with Graph Convolutional Network

Emotion-cause pair extraction (ECPE), which aims at simultaneously extracting emotion-cause pairs that express emotions and their corresponding causes in a document, plays a vital role in understanding natural languages. Considering that most emotions usually have few causes mentioned in their contexts, we present a novel end-to-end Pair Graph Convolutional Network (PairGCN) to model pair-level contexts so that to capture the dependency information among local neighborhood candidate pairs. Moreover, in the graphical network, contexts are grouped into three types and each type of contexts is propagated by its own way. Experiments on a benchmark Chinese emotion-cause pair extraction corpus demonstrate the effectiveness of the proposed model.


Introduction
Emotion-cause pair extraction (ECPE), which was first proposed in Xia & Ding (2019), aims to extract emotion expressions and their corresponding causes in a document simultaneously. Different from emotion cause extraction (ECE) Gui et al., 2016) which extracts the causes for given emotion expressions, EPCE is a much more challenging task.
There has been a surging interest in developing neural models either for emotion cause extraction or for emotion extraction, while ECPE, as a special causal relation extraction task, is newly proposed and remains largely unexplored. Previous research on ECPE (Xia and Ding, 2019) focused on designing pipeline systems in which emotion clauses and cause clauses are extracted separately, and the two sets of clauses are paired to generate candidate emotion-cause pairs, and then emotion-cause pairs are selected from these candidate pairs with a filter. Hence, prediction errors unavoidably accumulate through the pipeline framework. Therefore, in this work, we aim to design an end-to-end framework in which any two clauses in a document are paired (i.e., one is a candidate emotion clause, and the other is a candidate cause clause) so as to generate candidate emotion-cause pairs and then emotion-cause pairs are selected from these candidate pairs. E.g., in Fig. 1, there are 25 candidate pairs, and only (c 4 , c 2 ) and (c 4 , c 3 ) are emotion-cause pairs. Furthermore, modelling contextual information for a candidate pair is also crucial for ECPE. In previous pipeline systems (Xia and Ding, 2019), two features were extracted for each clause in a document for emotion extraction and emotion cause extraction respectively, and then a candidate pair was represented by the combination of these clause-level features. However, dependency among candidate pairs does not take into account. In fact, an emotion usually has a few cause clauses occurring within a specific distance from the emotion expression. E.g, in the Chinese ECPE corpus provided by Xia & Ding (2019), ∼90% emotions has one and only one cause clause, and ∼96% cause clauses occur within a window size of 2 from their corresponding emotion clauses. The emotion-cause co-occurrence property indicates that in a local neighborhood if one candidate pair has been detected as an emotion-cause pair, other candidate pairs are usually non-emotion-cause pairs. Thus, modelling contextual information should consider pairlevel dependency. Here, a local neighborhood refers to a set of candidate pairs whose candidate emotion clauses are the same, and candidate cause clauses are not far away from each other. In this paper, we propose a novel Pair Graph Convolutional Network (PairGCN), an end-to-end model for ECPE. We first construct a pair graph to model three types of dependency relations among the candidate pairs in a local neighborhood, where a node represents a candidate pair and an edge connecting two nodes represents a dependency relation between the corresponding two candidate pairs. Then, a Graph Convolutional Network (GCN) is designed to use the three types of edges to propagate contextual information in the pair graph.
Above all, our main contributions can be summarized as follow: • We propose a PairGCN model that utilizes an end-to-end framework for ECPE.
• We design a Graph Convolutional Network to model three types of dependency relations among local neighborhood candidate pairs so as to facilitate the extraction of pair-level contextual information.
• Our model is evaluated on a benchmark Chinese emotion-cause pair extraction dataset for three tasks, i.e., emotion-cause pair extraction, emotion extraction, and emotion cause extraction. Experimental results demonstrate the effectiveness of our PairGCN model.

Related Works
In this section, we will briefly summarise related research on emotion cause extraction (ECE), emotioncause pair extraction (ECPE), and graph neural networks (GNNs).

Emotion Cause Extraction and Emotion-Cause Pair Extraction
The task of emotion cause extraction (ECE) which extracts the causes of given emotion keywords has been intensively studied for years. Most of the previous works focused on contextual information extraction from the context of the given emotion keyword either with manual rules or with machine learning methods.  constructed an emotion cause corpus from Sinica Corpus and then built a rule-based system to extract linguistic features. Based on this corpus,  proposed a multi-label approach with linguistic patterns that can capture linguistic cues in contexts with manual rules. Other rule-based feature extraction methods (Neviarouskaya and Aono, 2013;Li and Xu, 2014;Gao et al., 2015a;Gao et al., 2015b;Yada et al., 2017;Yu et al., 2019) were also proposed to extract contextual features.
Other than rule-based methods, Gui et al. (2016) constructed a Chinese event-driven ECE corpus with SINA city news and proposed a convolution kernel-based multi-kernel Support Vector Machine (SVM) to extract contextual features from given syntactical trees. Afterward, deep learning has attracted attention from the ECE research community. Gui et al. (2017) converted the ECE task to a Question Answering (QA) task and proposed a Convolutional Multiple-Slot Deep Memory Network to store relevant contextual information. Other neural models Xu et al., 2019) were also proposed to extract contextual information. Besides, Chen et al. (2018) presented a neural network-based joint approach for emotion extraction and emotion cause extraction to capture mutual benefits across these two emotion analysis tasks. Different from ECE in which emotion keywords are provided before the extraction of their causes, the task of emotion-cause pair extraction (ECPE) was first proposed in Xia & Ding (2019), in which emotions and their corresponding causes are extracted at the same time. For this new task, they proposed a two-step approach, which firstly extracted emotion clauses and cause clauses individually using an interactive multi-task learning network which consists of two hierarchical BiLSTM networks, and then each emotion clause was paired with each cause clause and these candidate pairs were filtered by a logistic regression model. Overall, in the previous works on ECE and ECPE, modelling contextual information does not consider dependency relations among local neighborhood candidate pairs.

Graph Neural Networks
The Graph Convolutional Network (GCN) was first proposed in Kipf & Welling (2017) for node classification, which operated directly on a graph. After that, Graph Neural Networks have been widely applied to various NLP tasks, such as relation extraction, aspect-level sentiment analysis, and text classification.  used GCNs to capture long-range relations among dependency trees and further applied a novel pruning strategy to the input trees. Sun et al. (2019) proposed a GCN for aspect-level sentiment analysis, which propagated both contextual and dependency information from opinion words to aspect words. In addition, Yao et al. (2019) built a text graph based on word co-occurrence and document-word relations and then learned a Text Graph Convolutional Network for text classification. Ghosal et al.(2019) used two layers of GCNs to capture speaker information for emotion recognition in conversations. In this paper, we attempt to use GCNs to model dependency relations in a local neighborhood so as to capture pair-level contextual information for ECPE.

Methodology
In this section, we briefly introduce the definition of ECPE. Then, we describe our PairGCN which models two types of contexts for ECPE: sequential clause context and pair-level context. The former refers to the clause sequence in a given document that provides sequential information for each clause, and the latter refers to the candidate pairs in a local neighborhood which gives dependency information for each candidate pair. Accordingly, as illustrated in Fig. 2, there are two encoders in our PairGCN: clause-level context encoder and pair-level context encoder. The clause-level context encoder uses two hierarchical BiLSTM networks to model the sequential clause context and then extracts an emotion feature and a cause feature for a clause respectively. The pair-level context encoder uses a Pair Graph Convolutional Network to model the pair-level context and then extracts a contextual feature for a candidate pair. Finally, classification assigns a label to a candidate pair according to its feature representation.

Task Definition
Given a document D = {c 1 , c 2 , . . . , c L }, the clauses are formed into a set of candidate emotion-cause pairs P using the Cartesian product.
where c e i is clause c i serving as a candidate emotion clause, c c j is clause c j serving as a candidate cause clause, and there are totally L × L candidate pairs in P . Given a candidate pair c p i,j = (c e i , c c j ), ECPE Figure 2: The overview of our Pair Graph Convolutional Network. In a pair graph, solid lines are D1 edges, dashed lines are D2 edges, and SL edges are eliminated for simplification. In addition, blue nodes represent emotion-cause pairs, and both white nodes and gray nodes are non-emotion-cause pairs. assigns a binary label, where "1" means that clause c e i expresses an emotion and clause c c j provides the cause of this emotion, and "0" indicates that such an emotion-cause relation does not exist in the candidate pair. E.g., in Fig.1, (c 4 , c 2 ) is an emotion-cause pair (with label "1") and (c 4 , c 5 ) is a nonemotion-cause pair (with label "0").

Clause-level Context Encoder
For clause c t , we employ a clause-level context encoder to extract two features based on its sequential clause context: the clause-level emotion feature v e t when c t serves as a candidate emotion clause, and the clause-level cause feature v c t when c t serves as a candidate cause clause. In order to extract the emotion features and the cause features respectively, the clause-level context encoder consists of two hierarchical BiLSTM networks (i.e., the cause encoder and the emotion encoder in Fig. 2 To further capture contextual information for a clause from the perspective of the whole document, we feed either u e or u c to a clause-level BiLSTM. Moreover, although emotions can be identified solely without their causes, identifying whether an event is the cause of an emotion could be much difficult if the relevant emotion information does not be provided. Thus, our clause-level BiLSTM uses different input to extract emotion features (see Eq. 3) and cause features (see Eq. 4) respectively: where [, ] is the concatenating function, BiLSTM e c is a clause-level BiLSTM to extract an emotion feature v e t ∈ R 2d h , and BiLSTM c c is another clause-level BiLSTM to extract a cause feature v c t ∈ R 2d h .

Pair-level Context Encoder
We propose a pair-level context encoder to extract contextual information that can capture dependency among local neighborhood candidate pairs. We construct a pair graph (e.g., Pair Graph 3 and 4 in Fig. 2) to model the candidate pairs in a local neighborhood. Then, we design a feature transformation process (i.e., GCN in Fig. 2) to transform clause-level contextual features into pair-level contextual features.

Pair Graph Construction
Nodes: Given the set of candidate emotion-cause pairs P , each candidate pair is considered as a node. Moreover, a candidate pair c p i,j = (c e i , c c j ) is represented as v p i,j which concatenates the emotion feature v e i and the cause feature v c j output from the clause-level context encoder: Edges: because the candidate pairs in a local neighborhood have the same candidate emotion clause, we build a pair graph for the candidate emotion clause. In the case of a document with L clauses, there are L pair graphs in total. E.g., in Fig. 2, there are 5 pair graphs for a document with 5 clauses. Furthermore, a cause clause is likely to appear 1 or 2 offset of its emotion clause. E.g., as illustrated in Table 1, 95.8% cause clauses are mentioned within a window size of 2 from their corresponding emotion clauses. Therefore, during building a pair graph for a candidate emotion clause c e i , the nodes in its corresponding pair graph are: Considering that a node has different influences to its neighboring nodes, three types of edges, namely SL, D1, and D2 edges, are used in a pair graph: (1) SL edge: This is a self-loop edge for the self-transformation of a node.
(2) D1 edge: This is an edge connecting two nodes which have a distance of 1 between their candidate cause clauses (e.g., the edges between c p i,i and c p i,i±1 ). (3) D2 edge: This is an edge connecting two nodes which have a distance of 2 between their candidate cause clauses (e.g., the edges between c p i,i and c p i,i±2 ). The incorporation of these edges into a pair graph forms the three types of dependency relations among the candidate pairs in a local neighborhood and allows the contextual information transmit through these edges, which in succession would facilitate the extraction of the pair-level contextual features.

Feature Transformation
Inspired by Ghosal et al.(2019), we use two layers of GCN (i.e., two transformations) to capture the contextual information for a node in a pair graph.
For node c p i,j , the first transformation is applied to obtain its representation g 1 i,j using the features output from the clause-level context encoder. Specifically, the features of the nodes in the pair graph are aggregated with different transformation parameters according to the types of their edges linked to c p i,j : where W 1 D1 ∈ R d in ×dout , W 1 D2 ∈ R d in ×dout , and W 1 SL ∈ R d in ×dout are weight matrices for the nodes linked to node c p i,j with D1 edges, D2 edges, and SL edges respectively. In addition, z is a normalization factor which is the node degree. σ is a non-linear activation function and ReLU (Nair and Hinton, 2010) is used in this paper.
After that, the second transformation is applied to obtain the representation g 2 i,j for node c p i,j using the features output from the first transformation: where W 2 D1 ∈ R dout×dout , W 2 D2 ∈ R dout×dout , and W 2 SL ∈ R dout×dout are weight matrices for the normalized nodes linked to node c p i,j . Compared to the feature transformation process used in Ghosal et al. (2019), we distinguish contextual information propagation through D1 edges and D2 edges and use different weight matrices to deal with the two propagations separately. Moreover, using the two transformations plus D2 edges, contextual information can be propagated between any two nodes with the greatest distance in a pair graph. E.g., in Pair Graph 3 in Fig. 2, the information on c p 3,1 and c p 3,5 can be transmit to each other through the two transformations which use two D2 edges (i.e., c p 3,1 ↔ c p 3,3 ↔ c p 3,5 ).

Classification
Emotion-Cause Pair Extraction: Since two clauses in an emotion-cause pair are likely to appear within a specific distance (see Table 1), distance information needs to be taken into consideration during classification (Xia and Ding, 2019). Thus, for a candidate pair c p i,j , its representation for classification is the concatenation of g 2 i,j and d i,j , where d i,j ∈ R d dis is a distance embedding. Then, a softmax function is applied as follows:p where W p ∈ R (dout+d dis )×dp is a weight matrix, and b p ∈ R dpout is a bias vector. Finally, we obtain the predicted probability distributionp i,j and the corresponding predicted labelÊC i,j for the candidate pair c p i,j . During model training, we use Cross-Entropy loss as loss function. Emotion Extraction and Emotion Cause Extraction: After obtaining the ECPE predictions for all candidate pairs, we can extract emotion clauses and cause clauses from them. Specifically, for emotion extraction, the prediction labelÊ i for clause c i can be obtained as: Similarly, for emotion cause extraction, the prediction labelĈ j for clause c j can be obtained.

Datasets and Metrics
We evaluate the performance of our model on a Chinese ECPE corpus released by Xia & Ding (2019), which was constructed from a benchmark Chinese ECE corpus (Gui et al., 2016). In the Chinese ECPE corpus, there are 1,945 documents and 490,367 pair candidates in total, including 2,167 emotion-cause pairs and 488,200 non-emotion-cause pairs. In other words, there are less than 1% emotion-cause pairs in this corpus.
Similar to previous work (Xia and Ding, 2019), we evaluate our model on three tasks: emotion-cause pair extraction, emotion extraction, and emotion cause extraction. To obtain statistically credible results, we use their data-split setting, repeat the experiments 10 times, and then report the average results of precision (P ), recall (R), and F 1 -score (F 1 ) to evaluate the performances of our model. Moreover, for each experiment, we set aside 10% of training documents as development set.

Experimental Settings
In our experiments, we follow experimental settings in Xia & Ding (2019), using the same word embeddings pre-trained on the corpora from Chinese Weibo 1 with Word2Vec (Mikolov et al., 2013). Moreover, BERT representations (Devlin et al., 2019) are also utilized, where we use the based Chinese model. While extracting BERT embeddings, the basic input unit is a clause. Besides, the dimension of distance embeddings is 50, and other parameters of our models are listed in Table 2. Finally, the learnable parameters (including all weight matrices and bias vectors) are randomly initialized by a uniform distribution of U (−0.01, 0.01).  While training, we use the Adam optimizer (Kingma and Ba, 2015) to update all parameters. Each training batch contains 32 documents, and the learning rate is set to 0.005. To reduce over-fitting, dropout (Srivastava et al., 2014) is applied to all features vectors, including word embeddings and hidden representations, and it is set to 0.5.

Model Comparison
In order to evaluate the performance of our model, we make a comparison with the following three pipeline systems (1), (2), and (3), and one end-to-end system (4).
(1) Indep: This is an interactive multi-task learning pipeline system, which extracts emotion clauses and cause clauses using two hierarchical BiLSTM independently. Then, the two sets of clauses are paired with each other and emotion-cause pairs are extracted using a filter (Xia and Ding, 2019).
(2) Inter-CE: This is an enhanced version of Indep, which is capable of capturing the correlation between emotions and causes. While extracting emotion clauses and cause clauses, emotion cause extraction is used to improve emotion extraction (Xia and Ding, 2019).
(3) Inter-EC: This is another enhanced version of Indep, while it uses emotion extraction to improve emotion cause extraction during extracting emotion clauses and cause clauses (Xia and Ding, 2019).
(4) Hier-BiLSTM: This is an end-to-end model, which extracts emotion features and cause features using two hierarchical BiLSTM independently, and the concatenation of an emotion feature and a cause feature is used to represent a candidate pair. Specifically, the hierarchical BiLSTM is similar to the one used in our clause-level context encoder, except that the input to the clause-level BiLSTM in the cause encoder is only the word-level cause feature u c t (see Eq. 4).

Results
We first compare our PairGCN model with the three pipeline systems. As shown in Table 3, PairGCN outperforms all pipeline systems on ECPE. E.g., compared to the best pipeline system (i.e., Inter-EC), the F 1 score of PairGCN rises from 61.28% to 63.21%. Specifically, this performance gain mainly comes from the improvement on the precision score. Furthermore, PairGCN achieves lower recall and yet higher precision than Inter-EC on both emotion extraction and emotion cause extraction. This indicates that although less correct emotion cases and cause cases are detected by our PairGCN model, they are more likely to be matched with each other so as to lead to a significant increasing in the precision score of ECPE. Moreover, in terms of emotion extraction, though PairGCN is not trained with only emotion  Table 3: Experimental results of different models. "EC Pair Extraction" denotes the ECPE task. * denotes that experimental results are cited from Xia & Ding (2019).
labels, it still shows competitive performance with a F 1 score at 78.29%, compared to Inter-CE with the highest performance 83.0%.
Compared to the end-to-end baseline Hier-BiLSTM, our PairGCN model has great improvement over the three tasks with two types of embeddings. As shown in Table 3, compared to Hier-BiLSTM (or Hier-BiLSTM-BERT), the F 1 scores of PairGCN (or PairGCN-BERT) on emotion extraction and emotion cause extraction rise by ∼3%, and as a result, the F 1 score on ECPE increases ∼3%. This performance gain mainly comes from the significant improvement in recall scores on the three tasks. This means that PairGCN is capable of detecting more emotion-cause pairs with the help of the contextual features extracted by the pair-level context encoder.

Ablation Study
To further explore the effects of the three types of edges in our full model (i.e., PairGCN and PairGCN-BERT), we perform an ablation study and show the results in Table 4.
First of all, we investigate the impact of the pair-level context encoder by removing GCN from our full model (i.e., PairGCN w/o GCN and PairGCN-BERT w/o GCN). In other words, only features extracted by the clause-level context encoder (see Eq. 5) are feed to classification. As we can see from Table 4, compared to our full model, their F 1 scores of ECPE drop significantly (∼2%) with the two types of embeddings. The decreasing performance indicates that the GCN-based feature transformation process in the pair-level context encoder can effectively augment the effects of the features extracted by the clause-level context encoder on ECPE. This is also reflected by the improved performances of the other two tasks, i.e., emotion extraction and emotion cause extraction.
Secondly, from Table 4, we observe that after removing one type of edges (i.e., either D1 or D2) from our full model, the overall performance of ECPE degrades. E.g., for models removing D1 (i.e., PairGCN w/o D1 and PairGCN-BERT w/o D1), their F 1 score drops 0.9% and 0.7% with Word2Vec embeddings and BERT embeddings respectively. For models removing D2 (i.e., PairGCN w/o D2 and PairGCN-BERT w/o D2), their F 1 score drops ∼1.4% with the two types of embeddings. This indicates that the contextual information which is propagated either through D1 edges or through D2 edges in the pair-level context encoder is very useful for ECPE. Furthermore, compared to models removing D1, models removing D2 perform worse. Compared to D1 edges, D2 edges allows contextual information propagate more straightforward because of their greater distance and their own weight matrices (see Section 3.3.2), and therefore, the pair-level contextual information is effectively captured for ECPE.
Finally, compared to models removing D2 (i.e., PairGCN w/o D2 and PairGCN-BERT w/o D2) , the performances of models removing both D1 and D2 (i.e., PairGCN w/o D1&D2 and PairGCN-BERT w/o D1&D2) decrease slightly. This also confirms that it is necessary to distinguish D1 edges and D2 edges in a pair graph because of their different ways to propagate contextual information. Although information propagation through D1 edges and information propagation through D2 edges are relevant, they are not interchangeable. E.g., in Pair Graph 3 in Fig. 2, information propagation between c p 3,1 and c p 3,3 can be made through either of the two paths: the combination of two D1 edges (i.e., c p 3,1 ↔ c p 3,2  ↔ c p 3,3 ), and a D2 edge (i.e., c p 3,1 ↔ c p 3,3 ). During propagation, the first path brings more information because of passing more nodes (e.g., c p 3,2 ), and the second path is more straightforward.

Conclusion and Future Work
In this paper, we propose a novel end-to-end Pair Graph Convolutional Network (PairGCN) to extract pair-level contextual features for emotion-cause pair extraction. Experimental results indicate the capability of our PairGCN in capturing dependency among local neighborhood candidate pairs. In the future, we would like to tackle the problem of imbalanced data by reducing non-emotion-cause pairs.