Transition-based Directed Graph Construction for Emotion-Cause Pair Extraction

Emotion-cause pair extraction aims to extract all potential pairs of emotions and corresponding causes from unannotated emotion text. Most existing methods are pipelined framework, which identifies emotions and extracts causes separately, leading to a drawback of error propagation. Towards this issue, we propose a transition-based model to transform the task into a procedure of parsing-like directed graph construction. The proposed model incrementally generates the directed graph with labeled edges based on a sequence of actions, from which we can recognize emotions with the corresponding causes simultaneously, thereby optimizing separate subtasks jointly and maximizing mutual benefits of tasks interdependently. Experimental results show that our approach achieves the best performance, outperforming the state-of-the-art methods by 6.71% (p<0.01) in F1 measure.


Introduction
Emotion-cause pair extraction (ECPE) is a new task to identify emotions and the corresponding causes from unannotated emotion text . This involves several subtasks, including 1) Extracting pair components from input text, e.g., emotion detection and cause detection; 2) Combining all the elements of the two sets into emotion-cause pairs and eliminating the pairs that do not exist a causal relationship. For the former subtask, a clause can be categorized into "emotion", which usually contains an emotion keyword to express specific sentiment polarity, or "cause", which contains the reason or stimuli of an observed emotion. Then, the set of all possible emotioncause pairs will be fed into the second subtask to determine the relationship. In general, it is an essential issue in emotion analysis since it provides * * Co-Corresponding Authors a new perspective to investigate how emotions are provoked, expressed, and perceived. Figure 1 shows an example of ECPE, and the text is segmented into three clauses. In this instance, only the second clause and the third clause hold an emotion causality, where "I lost my phone while shopping" is the cause of emotion " I feel sad now". Thus, the extracted results of this sample should be {I lost my phone while shopping, I feel sad now}. The goal of ECPE is to identify all the pairs that have emotion causality in an emotion text.
However, from both theoretical and computational perspectives, due to the inherent ambiguity and subtlety of emotions, it is hard for machines to build a mechanism for understanding emotion causality like human beings. Previous approaches mostly focused on detecting the causes towards the given annotation of emotions, which was followed by most of the recent studies in this field Gui et al., 2014;Gao et al., 2015;Gui et al., 2016Gui et al., , 2017Li et al., 2018;Fan et al., 2019). Nevertheless, it suffers that emotions must be annotated before extracting the causes, which limits the applications in real-world scenarios. Towards this issue,  presented a new task to extract emotion-cause pairs from the unannotated text. However, they followed a pipelined framework, which models emotions and causes separately, rather than joint decoding. Hence, to overcome the drawback of error propagation may occur in existing methods. Ideally, the emotion-cause structure should be considered as an integral framework, including representation learning, emotion-cause extraction, and reasoning.
To this end, we transform the ECPE problem into a procedure of directed graph construction, from which emotions and the corresponding causes can be extracted simultaneously based on the labeled edges. The directed graph is constructed by designing a novel transition-based parsing model, which incrementally creates the labeled edges according to the causal relationship between the connected nodes, through a sequence of defined actions. In this process, the emotion detection, cause detection, and their causality association can be jointly learned through joint decoding, without differentiating subtask structures, so that the maximum potential of information interaction between emotions and causes can be exploited. Besides, the proposed model processes the input sequence in a psycholinguistically motivated left to right order, consequently, reducing the number of potential pairs needed to be parsed and leading to speed up (if all clauses are connected by Cartesian products, the time complexity will be O(n 2 )).
Regarding feature representation, BERT (Devlin et al., 2019) is used to produce the deep and contextualized representation for each clause, and LSTMs (Hochreiter and Schmidhuber, 1997) are performed to capture long-term dependencies among input sequences. In addition, action history and relative distance information between the emotion-cause pairs are also encoded to benefit the task.
To summarize, our contribution includes: • Learning with a transition-based framework, so that end-to-end emotion-cause pair extraction can be easily transformed into a parsinglike directed graph construction task.
• With the proposed joint learning framework, our model can extract emotions with the corresponding causes simultaneously, often with linear time complexity.
• Performance evaluation shows that our model statistically significant improvements over the state-of-the-art methods on all the tasks 1 .

Task Definition
The formal definition of emotion-cause pair extraction is given in . Briefly, 1 The code and dataset are available at: https:// github.com/HLT-HITSZ/TransECPE given a piece of emotion text d n 1 = (c 1 , c 2 , . . . , c n ), which consists of several manually segmented clauses. The goal of ECPE is to output all potential pairs where exist emotion causality: where c e is an emotion clause, and c c is the corresponding cause clause. Note that, the previous emotion cause extraction (ECE) task aims to extract c c given the annotation of c e : {c c → c e }. In comparison, the ECPE is a new and challenging task since there is no annotation provided in the emotion text. Similar as the traditional ECE task, the ECPE is also defined at the clause level, because it is difficult to describe emotion causes at the word or phrase level. That is, in this paper, the "emotion" and "cause" are refer to "emotion clause" and "cause clause", respectively.

Our Approach
We present a new framework aimed at integrating the emotion-cause pair extraction into a procedure of parsing-like directed graph construction. The proposed framework incrementally constructs and labels the graph from input sequences, scoring partially segmented results using rich non-local features. Figure 2 shows the overall architecture of the proposed framework. In the following, we first introduce how to construct the directed graph based on a novel transition-based system, then the details of feature representation will be described.

Directed Graph Construction
Let G = (V, R) be an edge-labeled directed graph where: V = {1, 2, . . . , n} is the set of nodes that correspond to clauses in the input text and

RA lt
(σ|σ 1 |σ 0 , β 0 |β, E, C, R) is the set of labeled edges. We will denote a connection between a head node i ∈ V and a modifier node j ∈ V as i l − → j, where l ∈ {l t , l n } is the causality label connecting them. l t indicates the node i is the cause of the emotion node j while l n indicates node j is an emotion but node i is not the corresponding cause. Besides, other nodes irrelevant to the final result have no edges. Note that, in this task, a node can be emotion and the corresponding cause simultaneously. Furthermore, an emotion node can also be associated with multiple causes. Thus, the acyclicity and single-head constraints are not necessary for our model, as arbitrary graphs are allowed.
We build the directed graph by designing a novel transition-based parser. Formally, each state of our parser is represented by a tuple: S = (σ, β, E, C, R), where σ and β are disjoint lists called stack and buffer, which store the indices of nodes that have been processed and to be processed, respectively. E is the set of emotions, and C is the set of causes. R is used to store the edges generated so far. Besides, action history is stored to a list A.
The definition of action set plays a crucial role in the transition-based system, and it relies on the type of task. As shown in Table 1, we define 6 types of actions based on our empirical observation, and their logics are summarized as follows: • SHIFT (SH). Pops β 0 and puts it on the top of σ. It is legal only when the β is not empty. Table 2: Transition sequence for the text in Figure 1.

Stack Buffer Action Emotion
• RIGHT-ARC lt (RA lt ). It assigns an edge from σ 1 to σ 0 with label l t : σ 1 lt − → σ 0 , then copies σ 0 to E and pops σ 1 from σ to C.
• LEFT-ARC ln (LA ln ). It denotes a relation from σ 0 to σ 1 : σ 1 ln ← − σ 0 and copies σ 1 to E. Note that, we move β 0 to the top of σ to improve coverage rather than pops σ 0 , because σ 0 may be the cause of incoming nodes in the β.
• CYCLE-ARC (CA). It assigns a loop edge on the node σ 0 with label l t and then copies σ 0 to both E and C.
Action Constraints. To ensure that each parser state is valid, we need to specify some constraints on the action. For example, RIGHT- * and LEFT- * can only be conducted when there are at least two elements in the σ. We also empirically set a constraint that RIGHT-ARC ln will be performed when σ|σ 1 |σ 0 are both emotions but has no emotion causality. Additionally, in practical, CYCLE-ARC may conflict with other actions, e.g., σ 0 is the cause of itself but is also the cause of σ 1 , which conflicts with the LEFT-ARC lt . For simplicity and efficiency, we separate it from other actions and distinguish it by training a binary classifier only depends on the representation of σ 0 . Table 2 illustrates the gold-standard sequence of transitions for the text in Search Algorithm. For the ECPE task, we transform it into a procedure of directed graph construction by a sequence of actions. The input is an emotion text d n 1 = (c 1 , c 2 , . . . , c n ) and the output is the corresponding sequence of actions A m 1 = (a 1 , a 2 , . . . , a m ). Hence, the task can be regarded as searching for an optimal action sequence A * given the stream of clauses d n 1 : Formally, at step t, our model predicts the next action based on the current system state S t and the action history A t−1 1 . Thus, the task is modeled as: (3) where a t is the generated action at step t, and S t+1 is the updated system state according to a t .
Let r t to denote the representation for computing the probability of the action a t at step t, this yields: where w a denotes a learnable parameter vector and b a is a bias term. The set A(S) represents the legal actions that can be taken given the current parser state. Finally, the overall optimization function is: where the ECPE is merged into a transition-based action prediction task. For efficient decoding, the maximum probability action is chosen greedily until the parsing procedure is termination.

Neural Transition-based Model
We apply BERT to produce the representation for each clause and use LSTMs to capture long-term dependencies of each parser state.
Representation of Clause. Given an emotion text d n 1 = (c 1 , c 2 , . . . , c n ) consisting of n clauses and each clause c i = (w i1 , w i2 , . . . , w il ) contains l words. We formulate each clause as a sequence is a special classification token that the final hidden state is used as the aggregate sequence features and [SEP] is a dummy token not used in our model. Thus, we obtain the hidden representation as is the size of hidden dimension and |x i | is the length of sequence x i . Then, the text d n 1 can be represented as Representation of Parser State. When the parsing starts, the parser state will be initialized to ([ ], [1, 2, . . . , n], ∅, ∅, ∅) and a series of actions will consume the clauses in the buffer to incrementally build an output until reaches the terminal state ([. . . , $], [ ], E, C, R), as shown in Table 2.
Specifically, at step t, considering the triple (σ t , β t , A t ), where σ t = (. . . , σ 1 , σ 0 ), β t = (β 0 , β 1 , . . .) and A t = (. . . , a t−2 , a t−1 ). For the stack, to summarize the information from both directions, we use bidirectional LSTM to exploit two parallel passes, thus, the feature representation of σ t is denoted as: , d l is the size of hidden dimension of LSTM and |σ t | is the size of σ t . Similarly, we can get the representation for β t by: where For action sequence, we map each action a to a distributed representation e a through a looking-up table E a , and apply an unidirectional LSTM to obtain the complete history of actions from left-to-right: Once a new action a t is generated, the embedding e at will be added into the rightmost position of the LSTM a . To enhance the position relation between the pair (σ 1 , σ 0 ), we also represent their relative distance d as an embedding e d from a looking-up table E d . The final representation of parser state at step t is the combination of these features.
Action Reversal. Let us visit the example in Figure 1 again. Reading it from left-to-right, as shown in the top of Figure 3, we see the clause "I lost my phone while shopping" trigger the emotion "I feel sad now", so the predicted action would be RIGHT-ARC lt . However, from a different perspective, we read it from right-to-left, as shown in the bottom of Figure 3, the cause "I lost my phone while shopping" behind the emotion "I feel sad now", so the predicted action should be reversed to LEFT-ARC lt . That is, − → s t and ← − s t should be regarded as different features to produce different action. Based on this observation, we apply r t andr t to predict the original action and reversed action, respectively, which can be used to mine the deep directional information for this task: where ReLU is an activation function for nonlinearity. Index 0 and 1 indicate the first and second representation of σ and β, −1 indicates the last representation of action history.
Training. By learning with the transition-based framework, we convert the gold output structure in a set of training data into a gold sequence of defined actions. For each parser state at step t, we maximize the log-likelihood of the classifier in formula (5), which can be revised as: whereâ t is the reversed action, and p(c t |s 0 t ) is the predictive distribution of CYCLE-ARC which is separated from the other actions due to the action constraints. λ is the coefficient of L 2 -norm regularization, and θ denotes all the parameters in this model. Note that, during the test decoding, only r t and s 0 t are used to predict the next action.
2019). The corpus collected from SINA city news 2 and the details are summarized in Table 3.

Implementation Details
In this paper, we stochastically divide the corpus into a training/development/test set in a ratio of 8:1:1. In order to obtain statistically credible results, we evaluate our method 20 times with different data splits by following  and then perform one sample t-test on the experimental results. The average results of Precision (P ), Recall (R) and F-measure (F 1) are employed to measure the performance. Note that when we extract the emotion-cause pairs, we obtain the emotions and causes for each text simultaneously. Thus, we also evaluate the performance of emotion extraction and cause extraction in our model. We adopt BERT Chinese as the basis in this work 3 . Adam optimizer is used for online learning (Kingma and Ba, 2015), and initial learning rates for the BERT layer and top MLP layer are set to 1e-5 and 1e-3, respectively. The hidden size of MLP layer is set to 256, and the hidden size of all LSTMs is set to 128 with 1 layer. The embeddings of position and action are initialized randomly with dimension 128 and keep unchanged during the training stage. The dropout rate is 0.5, the batch size is 3, and the coefficient of L 2 term is 1e-5. We train the model 10 epochs in total and adopt early stopping strategy based on the performance of development set. Then, the highest F-measure model on the development set is used to evaluate the test set.

Baselines
We first compare our transition-based model with the method proposed by ,

Method
Emotion extraction Cause extraction Emotion-cause pair extraction  To compare with other joint models, we implement SL-BERT (Zheng et al., 2017) and MT-BERT (Caruana, 1993) for this task. The former aims to joint extract entities and relations based on a novel tagging scheme with multiple labels and the other is a multi-task learning framework by sharing the hidden layers among all tasks. We implement them both based on BERT to be consistent with our experimental setting.
We also evaluate our model by only removing the transition procedure to reveal the effect of the transition-based algorithm, denoted as "-transition". Besides, for a fair comparison, we use LSTM as the basic encoder of clauses instead of BERT and keep the same experimental setting by following , namely LSTM based . Table 4 shows the experimental results. With the transition-based algorithm, our proposed model achieves the best performance over all the three tasks, outperforming a number of competitive baselines by at least 1.74%, 3.30% and 3.33% in F 1 score, respectively. The improvements are significant with p < 0.01 in one sample t-test.

Main Analysis
Regarding pipelined approaches, Indep considers framework individually and ignores the fact that emotions and causes are usually mutually indicative, leading to the lowest performance. On the contrary, Inter-CE and Inter-EC yield better results by exploiting the relevance between emotions and causes. By comparing Inter-CE and Inter-EC, we find that the improvement of Inter-EC on cause extraction is much more than the improvement of Inter-CE on emotion extraction, thus Inter-EC shows better results. Differently, our model jointly extracts emotion-cause pairs and shows consistent performance improvement over the Indep-CE and Indep-EC, demonstrating the superiority of onestage model by reducing error propagation.
In comparison with other joint models, our proposed model significantly outperforms SL-BERT by 12.56%, 4.44 % and 5.52% in F 1 measure, respectively. We guess that SL-BERT jointly identifies emotion-cause pairs but still follows an emotion → cause pipelined decoding order. In contrast, we achieve fully joint decoding with interleaving actions for all the three tasks, thereby achieving better information interaction. Besides, our model also yields better results than MT-BERT, one possible reason is that the interdependence between the emotions and causes cannot be mined effectively only through parameter sharing.
We also show the results where BERT embeddings are replaced by LSTM from the input. It can be seen that the results still outperform the existing methods by at least 3.06% in F 1 score. Furthermore, when we remove the transition procedure, the performance drops heavily over all the three tasks, especially with a 7.87% decrease in F 1 measure on the ECPE task. These results show that the improvements provided by the proposed transition system are more noticeable than other components.

Method
Emotion extraction Cause extraction Emotion-cause pair extraction  Table 5: Feature ablation experiments. The results are average score over 20 runs, and the best scores are in bold.

Ablation Study
To further evaluate the contribution of neural components, we conduct feature ablation experiments to study the effects of different parts. As shown in Table 5, the F 1 score decreases most heavily without LSTM (-4.40%), which indicates that it is necessary to capture non-local dependencies among input clauses, and our model can benefit from it effectively. Distance is also particularly relevant to the model by capturing the position information between the emotions and causes, which is consistent with our intuition that the closer a clause is to the emotion, the higher probability it should be the cause. Seen from the results, the history of actions stored in action has a crucial influence on predicting the next action. The results also show that reversal, which can be regarded as a data augmentation strategy, is useful by exploring the deep directional information. Without buffer, the F 1 score drops 1.8% over the ECPE task. It may be due to the reason that buffer can provide more valuable information about the succeeding sequence.

Action Set Validation
To gain more insights into the parsing procedure, we analyze the situations that emotion-cause pairs in an emotion text cannot be extracted entirely by our defined actions, as illustrated in Figure 4. For the pseudo sample in Figure 4(a), it can be parsed by the transition system using computation:  In both situations, our model can only extract one emotion-cause pair (i.e., RA lt (1 lt − → 3) and RA lt (1 lt − → 2), respectively.), because the cause which belongs to another emotion has been popped during the parsing procedure.
Based on this observation, one crucial problem about the proposed model is how many situations involving the emotion-cause transformation can be covered by the action set defined here. Although a formal theoretical proof is beyond the scope of this paper, we can empirically verify that the action set works well from Table 4. Going one step further, to further validate the actions, we input the texts into our transition system to obtain the "pseudo-gold" emotion-cause pairs P based on the annotation, which can give us the correct action to take for a given parse state. Then we compare P with the gold-standard emotion-cause pairs P to see how similar they are. On the whole dataset, we obtain an overall 98.5% F 1 score for P, P , which indicates the upper bound of our transition system can achieve 98.5% in F 1 score. Thus, the defined action set here is capable of extracting emotion-cause pairs through a sequence of actions.

Error Analysis
We also perform an experiment to understand the impact of action reversal on the performance.  ure 5 shows the confusion matrices that present a comparison between the predicted actions and corrective actions. The results shows that SHIFT, LEFT-ARC ln and RIGHT-ARC ln yield higher accuracy on both Figure 5(a) and Figure 5(b) since they are account for a large proportion of the total actions. As expected, our model makes more mistakes involving the RIGHT-ARC lt and LEFT-ARC lt , which play decisive roles in identifying the emotion-cause pairs. Especially for the LEFT-ARC lt action, there is only about 0.43% in the total actions, turning out to be the most difficult action to learn given the relatively small training samples. Thus, as shown in Figure 5(a), the accuracy for LEFT-ARC lt is 0, which drops the overall performance heavily. However, when we apply the action reversal into our model, boosting the accuracy of LEFT-ARC lt by 58.8% and further improving the overall performance. We guess that based on action reversal, the original RIGHT- * action can be reversed to LEFT- * and vice versa, so that double the training actions. The results in Figure 5 show that our proposed model can capture this subtlety of emotions effectively by exploiting the deep directional information through action reversal strategy.

Related Work
Different from the traditional emotion analysis, which aims to identify emotion categories in text. Emotion cause extraction (ECE) reveals the essential information about what causes a certain emotion and why there is an emotional change. It is a more challenging task due to the inherent ambiguity and subtlety of emotion expressions.  first defined the emotion cause extraction as a word-level extraction task. They manually constructed a dataset from the Academia Sinica Balanced Chinese Corpus and generalized a series of linguistics rules based on the dataset. Based on this setting, there are some studies have been exploited for this task such as rule-based methods Gao et al., 2015;Yada et al., 2017) and machine learning methods (Ghazi et al., 2015;Song and Meng, 2015).  converted the task from word-level to clause-level due to a clause may be the most appropriate unit to detect causes, and extracted causes using six groups of manually constructed linguistic cues. By following this task setting, Gui et al. (2014) extended the rule-based features to 25 linguistics cues, then trained classifiers on SVM and CRFs to detect causes. Gui et al. (2016) released a new Chinese emotion cause dataset collected from SINA city news 4 and proposed a multi-kernel based method to identify emotion causes. Following this corpus,  proposed a learning to re-rank method based on a series of emotion-dependent and emotion-independent features. Recently, inspired by the success of deep learning architecture, some studies focused on identifying emotion causes with well designed neural network and attention mechanism (Gui et al., 2017;Li et al., 2018Li et al., , 2019Fan et al., 2019;. All of the above studies extracted emotion causes rely on the given emotion annotations, which limits the application in real-world scenarios due to the expensive annotations. Targeting this problem,  proposed a novel task based on ECE, namely emotion-cause pair extraction (ECPE), which aims at extracting emotions and the corresponding causes from unannotated emotion text. However, they followed a pipelined framework which first detects emotions and causes with individual learning frameworks, then performed emotion-cause pairing to eliminate the unmatched pairs, leading to a drawback of error propagation.
In this work, we design a novel transition-based model to extract emotions and causes simultaneously to maximize the mutual benefits of subtasks, thus alleviating the drawback of error propagation. Transition-based system is usually designed to model the chunk-level relation in a sentence for dependency parsing (Zhang and Nivre, 2011;Fernández-González and Gómez-Rodríguez, 2018). Apart from its application in dependency parsing, transition-based method has also achieved great success in other natural language processing tasks, such as word segmentation (Zhang et al., 2016), information extraction (Wang et al., 2018b;, disfluency detection  and nested mention recognition (Wang et al., 2018a). To the best of our knowledge, this is the first work which extracts the emotion-cause pairs in an end-to-end manner.

Conclusion
In this paper, we present a novel transition-based framework to extract emotion-cause pairs as a procedure of directed graph construction. Instead of previous pipelined approaches, the proposed framework incrementally outputs the emotion-cause pairs as a single task, thereby the interdependence between emotions and causes can be exploited more effectively. Experimental results on a standard benchmark demonstrate the superiority and robustness of the proposed model compared to a number of competitive methods.
In the future, one possible direction is creating complete graphs with their nodes being input clauses to achieve full coverage. Besides, graph neural network-based (Kipf and Welling, 2016) methods are also worth investigating to model the relations among nodes for this task.