Emotion-Cause Pair Extraction as Sequence Labeling Based on a Novel Tagging Scheme

The task of emotion-cause pair extraction deals with ﬁnding all emotions and the corresponding causes in unannotated emotion texts. Most recent studies are based on the likelihood of Cartesian product among all clause candidates, resulting in a high computational cost. Targeting this issue, we regard the task as a sequence labeling problem and propose a novel tagging scheme with coding the distance between linked components into the tags, so that emotions and the corresponding causes can be extracted simultaneously. Accordingly, an end-to-end model is presented to process the input texts from left to right, always with linear time complexity, leading to a speed up. Experimental results show that our proposed model achieves the best performance, outper-forming the state-of-the-art method by 2.26% ( p < 0 . 001 ) in F 1 measure.


Introduction
Emotion-cause pair extraction (ECPE) aims to extract all potential pairs of emotions and the corresponding causes from unannotated emotion texts, such as (c 3 , c 1 ) and (c 3 , c 2 ) in: Ex.1 A policeman visited the old man with the lost money, (c 1 )| and told him that the thief was caught. This task for pair extraction closely relates to the traditional emotion cause extraction task, which aims at identifying the causes for a given emotion expression. Many works (Gui et al., 2017;Li et al., 2018Li et al., , 2019Fan et al., 2019; related to emotion cause extraction have been published recently, and all of them are evaluated with the dataset released by Gui et al. (2016). However, it suffers * * Equal Contributions.
† † Corresponding author. that emotions must be annotated before extracting the causes, which is labor intensive and limits the applications in real-world scenarios. Towards this issue, Xia and Ding (2019) presents a new task, namely emotion-cause pair extraction, to extract emotions and the corresponding causes together. In comparison, it is a more challenging task due to the inherent ambiguity and subtlety of emotions, especially when there is no annotation information provided before extraction. Following this task setting, they propose a two-step approach to solve this task. However, limited by the inherent drawback of pipelined framework, error propagation may occur from the first procedure to the second. Recent studies (Song et al., 2020;Tang et al., 2020) have focused on solving this task using multitask learning framework (Caruana, 1993) with welldesigned attention mechanism (Bahdanau et al., 2015), but they extract emotion-cause pairs by calculating a pair matrix, which is based on the likelihood of Cartesian product among all clauses in texts, thus leading to the computational cost is expensive, that is, the time complexity is O(n 2 ).
In this paper, we define the joint emotion-cause pair extraction as a sequence labeling problem (Eger et al., 2017;Zheng et al., 2017), so that the emotion-cause structure can be integrated into an unified framework, including representation learning, pair extraction, and causality reasoning. The challenge is to also include emotion causality into the tagging scheme, i.e., the traditional BIO tagging is not suitable for this task. Targeting this problem, we design a novel tagging scheme with multiple labels which contain the information of causes and the triggered emotions associated with these causes, and we realize it by coding the distance between linked components into the tags. Accordingly, an end-to-end model based on this tagging scheme is presented to process the input sequences from left to right, consequently, reducing the number of potential pairs needed to be parsed and leading to a speed up.
Specifically, BERT (Devlin et al., 2019) is trained with the objective of masked language modeling and next-sentence prediction task, therefore, we base our model on the BERT to generate powerful, general-purpose linguistic representation for each clause. Then LSTMs (Hochreiter and Schmidhuber, 1997) will be applied to capture long-range dependencies among different clauses.
To summarize, our contribution includes: • We frame the ECPE task as a sequence labeling problem and propose an end-to-end model based on a novel tagging scheme with multiple labels, thereby the emotion-cause structure can be extracted simultaneously.
• The proposed model incrementally processes the input sequences from left to right, always with linear time complexity.
• Performance evaluation shows the superiority and robustness of the proposed model compared to a number of competitive baselines.

A Tagging Problem
We define X = (x 1 , x 2 , . . . , x n ) as an ordered clause sequence for an emotion text. There are several emotions and at least one cause corresponds to these emotions. The goal is to output all potential pairs where exist emotion causality. Due to the difficulty describing the emotion/cause at the word or phrase level (Chen et al., 2010), in this paper, the "emotion" and "cause" are refer to "emotion clause" and "cause clause" , respectively. This research investigates such a problem by sequentially tagging each clause x ∈ X with two- . . , −1, 0, 1, . . . , n − 1, ⊥}. Tag "C" represents the "Cause" tag which means the current clause is a cause clause, while tag "O" represent the "Other" tag, indicating the current clause is irrelevant to the final result. Moreover, d encodes the distance between the cause and the triggered emotion it relates to, e.g., "-1" denotes the previous clause is the related emotion, while "1" for the subsequent clause. The special symbol ⊥ indicates when a particular slot is not filled, e.g., a noncause clause (b=O) has no related emotion, thus it always associates with the symbol ⊥. For example, we could incrementally label the text in Ex.1 by sequence: While the total number of tags in Y is N t = 2 * (n − 1) + 1 + 1, which relies on the size of text X, resulting in the inconsistency during the training stage. Empirically, for emotion events, causes generally occur on positions very close to the emotions and occur frequently. As shown in Figure 1, in the dataset released by Gui et al. (2016), there are about 55% of all emotion-cause structures have distance "1", that is, the emotions behind the causes they attach to. Overall, around 95% of all emotion-cause distances lie in {-2, -1, 0, 1, 2}. Thus, we could let hyperparameter l to denote the left and right boundary which limits the scope of emotion corresponds to the current cause, i.e., d ∈ {−l, . . . , −1, 0, 1, . . . , l, ⊥}. Then, we have total N t = 2 * l + 1 + 1 tags, which keep consistent in the training stage.

The End-to-End Model
In this paper, the details of our model based on the novel tagging scheme will be described.
BERT Encoder Given an emotion text X = (x 1 , x 2 , . . . , x n ) consisting of n clauses and each clause x i = (w i1 , w i2 , . . . , w ik ) contains k words. We formulate each clause as a sequencex i = ([CLS], w i1 , . . . , w ik , [SEP]), where [CLS] is a special token that the final hidden state is used as the aggregate sequence features and [SEP] is a dummy token not used for this task. We first obtain the hidden representation as h i = BERT(x i ) ∈ R d h * |x i | where d h is the size of hidden dimension and |x i | is the length of sequencex i . Then, the text X can be represented as H X = [h 1 , h 2 , . . . , h n ], which will be fed into the LSTM encoder.
LSTM Encoder Based on the representation of H X , a LSTM layer is performed to model the context-dependent information among different clauses. To summarize the information from both directions, we use bidirectional LSTM to exploit two parallel passes, this yields: where i ∈ [1, n], both − → h i and ← − h i ∈ R dr * n , d r is the hidden size of LSTMs. The two directional hidden states are concatenated as the final clause Training The representationĥ i as the final feature for tag prediction and the model is trained by minimizing the cross entropy. Specifically, where FFN is a feed-forward neural network with the parameters randomly initialized, |D| is the size of training set, n is the length of text X j . y (j) i is the label of clause i in text X j and p (j) i is the normalized predictive probabilities of our special tags. Besides, θ denotes all the parameters in this model and λ is the coefficient of L 2 -norm regularization.

Experimental Setting
Dataset The only dataset released by Gui et al. (2016) is used to evaluate our proposed model. We also pre-process the whole dataset by following . In detail, there are 1746 samples with one emotion-cause pair, 177 samples with two pairs, and 22 samples with more than two pairs. Besides, the quartile information about clause number of per sample is also shown in Table  1. Moreover, we stochastically divide the corpus into a training/development/test set in a ratio of 8:1:1 and evaluate our method 20 times with different data splits to obtain statistically credible results.
Evaluation We adopt standard Precision(P ), Recall (R) and F-measure (F 1) to measure the performance and report the average results over 20 runs. Note that when we extract emotion-cause pairs, we obtain the emotions and causes for each text simultaneously. Thus, we also evaluate the performance of emotion extraction and cause extraction. Hyperparameters Our proposed model is trained using Adam optimizer (Kingma and Ba, 2015), and the initial learning rate is set to 3e-5. We set the batch size to 4, the coefficient of L 2 term to 1e-2, the hidden size of all LSTMs and FFN to 256. We regularize our network using dropout with rate 0.5 and adopt BERT Chinese as the basis in this work 1 . We perform grid search over the emotion boundary l ∈ {1,2,3,4,5,6}. The model is trained 10 epochs in total and the highest F 1-measure model on the development set is used to evaluate the test set.
Baselines In this paper, we compare our model with the following methods.
• Indep: Emotion extraction and cause extraction are trained independently. Then pairing them and eliminating the pairs that have no emotion causality; Inter-CE: The predictions of cause extraction are used to improve emotion extraction; Inter-EC: The predictions of emotion extraction are used to improve cause extraction. All of them are pipelined framework and proposed in .
• E2EECPE (Song et al., 2020): An end-to-end multi-task learning framework for constructing pair matrix through biaffine attention.
• LAE-MANN (Tang et al., 2020): The current state-of-the-art method using a multi-level attention mechanism based on LSTM or BERT encoder, denoted as LML and LMB here.

Main analysis
The experimental results are shown in Table 2. Indep yields the lowest performance, because it ignores the fact that emotions and causes are usually mutually indicative. Inter-CE and Inter-EC benefit from this interaction information, thus get Method Emotion extraction Cause extraction Emotion-cause pair extraction   better performance. Meanwhile, the joint models (E2EECPE, LML, LMB) have better performance compared to the previous pipelined methods by reducing error propagation. Specifically, LML outperforms E2EECPE by capturing mutual interdependence between emotions and causes using a multi-level attention mechanism, and LMB further improves the performance based on BERT embeddings. Our model performs worse on emotion extraction, because it inherently learns to detect causes firstly, then identifies emotions through distance tag. Overall, our model achieves better performance compared to the previous methods, significantly improves cause extraction by 5.15% and emotion-cause pair extraction by 2.26% in F 1-measure with p < 0.001. The reason may be that our model always processes the texts with linear time complexity, instead of based on Cartesian product, which the time complexity is O(n 2 ), thereby greatly reducing the search space.

Emotion Scope Limitation Analysis
It is intuitive that the larger emotion scope is allowed, the more situations are involved. In this section, we evaluate the power of distance limitation l for this task, and set the number of l from 1 to  6. Results are shown in Table 3. With the increasing scope of emotion, the performance improves. However, when the scope of emotion is more than 3, the performance decreases, because the larger emotion scope is set, the larger search space is. As such, we choose l = 3 in our final model since it gives the best performance in our experiments.

Runtime Analysis
Theoretically, our proposed model always labels the input texts from left to right with linear time complexity, but all the previous end-to-end model is O(n 2 ) time complexity. Nevertheless, we still perform a further experiment to confirm this superiority empirically. For simplicity and efficiency, we only conduct runtime analysis between LMB and ours, since LMB is also based on BERT and is the current state-of-the-art method. Figure 2 shows the running time consumed by models per epoch in different data folds. The results suggest that our model is 36% and 44% faster than LMB in training and inference stage respectively, indicating the efficiency of the proposed method.

Conclusion
In this paper, we consider the emotion-cause pair extraction as a sequence labeling problem and propose an end-to-end model based on a novel tagging scheme with multiple labels. The proposed model is capable of integrating the emotion-cause structure into a unified framework, so that emotions with the related causes can be extracted simultaneously.
Moreover, the proposed model parses the input texts in order from left to right, greatly reducing the search space, leading to a speed up. Experimental results demonstrate the effectiveness and robustness of the proposed method on a benchmark dataset. In the future, we will explore the extension of this approach to achieve full coverage.