A Symmetric Local Search Network for Emotion-Cause Pair Extraction

Emotion-cause pair extraction (ECPE) is a new task which aims at extracting the potential clause pairs of emotions and corresponding causes in a document. To tackle this task, a two-step method was proposed by previous study which first extracted emotion clauses and cause clauses individually, then paired the emotion and cause clauses, and filtered out the pairs without causality. Different from this method that separated the detection and the matching of emotion and cause into two steps, we propose a Symmetric Local Search Network (SLSN) model to perform the detection and matching simultaneously by local search. SLSN consists of two symmetric subnetworks, namely the emotion subnetwork and the cause subnetwork. Each subnetwork is composed of a clause representation learner and a local pair searcher. The local pair searcher is a specially-designed cross-subnetwork component which can extract the local emotion-cause pairs. Experimental results on the ECPE corpus demonstrate the superiority of our SLSN over existing state-of-the-art methods.


Introduction
Emotion cause analysis is a research branch of sentiment analysis and has gained increasing popularity in recent years Gui et al., 2016;. Its goal is to identify the potential causes that lead to the certain emotion. This is very useful in fields such as electronic commerce, where the sellers may concern about users' emotions towards the products as well as the causes of users' emotions.
Previous studies on emotion cause analysis mainly focus on the task of emotion cause extraction (ECE), which is usually formalized as a clause-level sequence labeling problem Gui et al., 2016;Yu et al., 2019;Fan et al., 2019). Given an annotated emotion clause, for each clause in the document, the goal of ECE task is to identify whether the clause is the corresponding cause. However, in practice, the emotion clauses are naturally not annotated, which may limit the application of the ECE task in real-world scenarios. Motivated by this,  first proposed the emotion-cause pair extraction (ECPE) task, which aims to extract all potential pairs of emotion and corresponding cause in a document. As shown in Figure 1, the example document has 17 clauses, among which, the emotion clauses are c 4 , c 13 , and c 17 (marked as orange), and their corresponding cause clauses are c 3 , c 12 , and c 15 (marked as blue). The goal of ECPE task is to extract all emotion-cause pairs: (c 4 , c 3 ), (c 13 , c 12 ), and (c 17 , c 15 ).
The ECPE task is a new and more challenging task. To tackle this task,  proposed a two-step method, which has been demonstrated to be effective. In the first step, they extracted emotion clauses and cause clauses individually. In the second step, they used Cartesian product to pair the clauses and then used a logistic regression to filter out the emotion-cause pairs without causality. In this method, the detection of emotion and cause, and the matching of emotion and cause are separately implemented in two steps.
(c 1 ) But when Hans heard these, (c 2 ) he seemed very jealous. (c 3 ) When Mr. Song had a son, (c 4 ) Hans was also very happy. (c 5 ) Hans had taught him to speak English since the boy was young. (c 6 ) Hans also speaks Spanish and German, (c 7 ) and he often went downstairs to the community to teach children English. (c 8 ) During Martial Arts Festival, (c 9 ) he also helped with a lot of translation work, (c 10 ) and was rated as advanced worker. (c 11 ) After the meeting, (c 12 ) the city organized all participants to travel, (c 13 ) Hans was very excited. (c 14 ) But before getting on the bus, (c 15 ) the tour guide said he was too old to go. However, when humans deal with the ECPE task, they usually consider the detection and matching problems at the same time. This is mainly achieved through the process of local search. For example, as shown in Figure 1, if a clause is detected as an emotion clause (e.g., c 4 ), humans will search its corresponding cause clause (i.e., c 3 ) within its local context (i.e., c 2 , c 3 , c 4 , c 5 , c 6 ). The advantage of local search is that the wrong pairs (e.g., (c 4 , c 12 )) beyond the local context scope can be avoided. Additionally, when local searching the cause clause corresponding to the target emotion clause, humans not only judge whether the clause is a cause clause, but also consider whether it matches the target emotion clause. In this way, they can avoid extracting the pairs (e.g., (c 13 , c 15 )) that are in the local context scope but mismatch. Similarly, when a cause clause is encountered, the corresponding emotion clause can also be searched within its local context scope.
Inspired by this local search process, we propose a Symmetric Local Search Network (SLSN) model. The model consists of two subnetworks with symmetric structures, namely the emotion subnetwork and the cause subnetwork. Each subnetwork consists of two parts: a clause representation learner and a local pair searcher (LPS). Among them, the clause representation learner is designed to learn the emotion or cause representation of a clause. The local pair searcher is designed to perform the local search of emotion-cause pairs. Specifically, the LPS introduces a local context window to limit the scope of context for local search. In the process of local search, the LPS first judges whether the target clause is emotion (cause), and then judges whether each clause within the local context window is the corresponding cause (emotion). Finally, SLSN will output the local pair labels (i.e., the labels of the target clause and the clauses within its local context window) for each clause in the document, from which we can get the emotion-cause pairs.
The main contributions of this work can be summarized as follows: • We propose a symmetric local search network model, which is an end-to-end model and gives a new scheme to solve the ECPE task.
• We design a local pair searcher in SLSN, which allows simultaneously detecting and matching the emotions and causes.
• Experimental results on the ECPE corpus demonstrate the superiority of our SLSN over existing state-of-the-art methods.

Symmetric Local Search Network
In this section, we first present the task definition. Then, we introduce the SLSN model, followed by its technical details. Finally, we discuss the connection between the SLSN model and the previous two-step method. The task of emotion-cause pair extraction (ECPE) is first studied by . In the ECPE task, each document d in the dataset D consists of multiple clauses d = [c 1 , c 2 , · · · , c n ]. The clause with emotional polarity (such as happiness, sadness, fear, anger, disgust and surprise) is labeled as an emotion clause c e . The clause that causes the emotion is called a cause clause c c . The pair of emotion clause and its corresponding cause clause is called an emotion-cause pair (c e , c c ). The goal of ECPE task is to extract all emotion-cause pairs in d: Note that each document may contain several (at least one) emotion clauses, and each emotion clause may correspond to several (at least one) cause clauses. Besides, the emotion clause and its corresponding cause clause may be the same clause.

An Overview of SLSN
As shown in Figure 2, SLSN receives a sequence of clauses from a document as input and predicts the local pair labels for these clauses, which can be directly converted into the corresponding emotion-cause (E-C) pairs. For each clause c i , SLSN predicts two types of local pair labels: E-LC labelŷ elc i and C-LE labelŷ cle i . The E-LC labelŷ elc i contains the emotion label (E-label)ŷ e i of the i-th clause and the local cause labels (LC-label) (ŷ lc i−1 ,ŷ lc i ,ŷ lc i+1 ) of the clauses near the i-th clause. Similarly, the C-LE labelŷ cle i contains the cause label (C-label)ŷ c i of the i-th clause and the local emotion labels (LE-label) (ŷ le i−1 ,ŷ le i ,ŷ le i+1 ) of the clauses near the i-th clause. Whether a clause is near the target clause is defined by the local context window, whose size is denoted as k (the case in Figure 2 is k = 1). That is, for a target clause, the scope of its local context includes the previous k clauses, itself, and the following k clauses. Note that, both the E-LC labelŷ elc i and the C-LE labelŷ cle i can be converted into their corresponding emotion-cause (E-C) pairs. For example, the corresponding E-C pair ofŷ elc We denote the E-C pair set corresponding toŷ elc i as P elc , and the E-C pair set corresponding toŷ cle i as P cle . Then the final E-C pair set of our method is the union of P elc and P cle . Of course, P elc , P cle or the intersection of P elc and P cle is also an option for the final pair set.

Components of SLSN
As shown in Figure 3, SLSN contains two subnetworks, i.e., the emotion subnetwork referred as E-net which is mainly for the E-LC label prediction and the cause subnetwork referred as C-net which is mainly for the C-LE label prediction. E-net and C-net have similar structures in terms of word embedding, clause encoder, and hidden state learning. After the hidden state learning layer, E-net and C-net use two types of local pair searchers (LPS) with symmetric structures for the local pair label prediction. The local pair E-net C-net Figure 3: Framework of SLSN model searcher is a specially designed cross-subnetwork module, which uses the hidden states of the clauses in both subnetworks for prediction. In the following, we introduce the components of SLSN in technical details.

Word Embedding
Before representing the clauses in the document, we first map each word in clauses into word embedding, which is a low-dimensional real-value vector. Formally, given a sequence of clauses consists of l i words. We map each clause into its word-level represen- where v j i is the word embedding of word w j i .

Clause Encoder
After word embedding, we use a Bi-LSTM layer followed by an attention layer as the clause encoder in both E-net and C-net to learn the representation of clauses. Formally, in E-net, given the word-level representation of the i-th as the input, the word-level Bi-LSTM layer first maps it to the hidden states r i = [r 1 i , r 2 i , . . . , r l i i ]. Then, the attention layer maps each r i to the emotion representation of the clause s e i by weighting each word in the clause and then aggregating them through the following equations: where W w , b w and u s are weight matrix, bias vector and context vector respectively. a j i is the attention weight of r j i . Similarly, in C-net, the cause representation of the i-th clause s c i is obtained using a clause encoder with a similar structure.

Hidden State Learning
After the clause encoder, we use a hidden state learning layer to learn the contextualized representation of each clause in the document. Formally, in E-net, given a sequence of emotion representations [s e 1 , s e 2 , . . . , s e n ] as input, the clause-level Bi-LSTM layer is used to map it to a sequence of emotion hidden states [h e 1 ,h e 2 ,· · · ,h e n ]. Similarly, in C-net, the sequence of cause hidden states [h c 1 ,h c 2 ,· · · ,h c n ] is obtained from a sequence of cause representations.

Local Pair Searcher
After obtaining the two types of hidden states, we design two types of local pair searchers (LPS) with symmetric structures in E-net and C-net respectively, to predict the local pair labels of each clause.
In E-net, LPS predicts the E-LC label for each clause, which contains E-label and LC-label respectively. For the E-label prediction, LPS only uses the emotion hidden state of the clause. Formally, given the emotion hidden state h e i of the i-th clause, LPS uses a softmax layer to predict its E-labelŷ e i through the following equation:ŷ where W e and b e are weight matrix and bias vector respectively. For the LC-label prediction, there are two cases under consideration. If the predicted E-label of the i-th clause is false, the corresponding LC-label is a zero vector. Because it is unnecessary to predict LC-label. Otherwise, LPS predicts the LC-label for all the clauses within the local context of the i-th clause. We denote these clauses as local context clauses. Assuming that the local context window size is k = 1 (the case in Figure 3), the local context clauses of the i-th clause are c i−1 , c i , and c i+1 respectively. For the LC-label prediction, both the emotion and cause hidden states of the clause are used. Formally, given the emotion hidden state h e i of the i-th clause and the cause hidden of the corresponding local context clauses, LPS first calculates an emotion attention ratio λ j for each local context clause using the following formula: where γ(h e i , h c j ) is an emotion attention function which estimates the relevance between the local cause and the target emotion. We choose the simple dot attention based on the experimental results (Luong et al., 2015). This emotion attention ratio λ j is then used to scale the original cause hidden states as follows: where q lc j is the scaled cause hidden state of the j-th local context clause. The ⊗ used in Figure 3 refers to Eq. (5), Eq. (6), and Eq. (7). We further use a local Bi-LSTM layer to learn the contextualized representation of each local context clause: Finally, the LC-labelŷ lc j of the j-th local context clause is predicted through the following equation: where o j is the concatenation of − → o j and ← − o j , W lc and b lc are the weight matrix and bias vector respectively. Similarly, in C-net, the LPS with symmetric structure in E-net is used to predict the C-LE label for each clause, which contains C-labelŷ c i and LE-labelŷ le j respectively.

Model Training
The SLSN model consists of two sub-networks, i.e., E-net and C-net. Given a sequence of clauses as input, the E-net is mainly used to predict their E-LC label, and the C-net is mainly used to predict their C-LE label. Thus, the loss of SLSN is a weighted sum of two components: where α ∈ [0, 1] is a tradeoff parameter. Both L elc and L cle consist of two parts of the loss: where β ∈ (0, 1) is another tradeoff parameter. L e , L lc , L c and L le are the cross-entropy loss of the prediction of E-labelŷ e i , LC-labelŷ lc i , C-labelŷ c i , and LE-labelŷ le i respectively: where y e i , y lc j , y c i and y le j denote the ground-truth, I(·) is an indicator function, p e and p c denote the times that I(·) equals 1 in Eq. (15) and Eq. (17) respectively, and η is used to deal with the class imbalance problem.

Connection to the Two-Step Method
The two-step method for the ECPE task was proposed by , which first extracted the emotion clause set and cause clause set individually, and then generated emotion-cause pairs and finally filtered out irrelevant pairs. In their approach, Cartesian product was used to pair the two sets. While in our model, we use the local context window size k to control the scope of local pair search. Considering the extreme case of k = n, LPS treats all clauses in the document as the local context of the target clause. This is actually equivalent to making a Cartesian product to search emotion-cause pair globally. Therefore, in the extreme case k = n, our method is approximately an end-to-end version of the two-step method . In practice, k usually takes a much smaller value than n, which can reduce the complexity of the model while achieving better performance.

Dataset and Experimental Settings
We evaluate our proposed model on a Chinese ECPE corpus  which is constructed based on the ECE corpus (Gui et al., 2016). In this paper, we use the same setting adopted by . The dataset is split into two parts, 90% for training, and the remaining 10% for testing. The results reported in the following experiments are an average of 10-fold cross-validation. We use Precision (P), Recall (R), and F1-score to measure the performance.
We use the word embedding provided by NLPCC which was pre-trained on a 1.1 million Chinese Weibo corpora with the word2vec toolkit (Mikolov et al., 2013) and the dimension of word embedding is 200. The number of hidden units of Bi-LSTM in this paper is set to be 100. The network is trained based on the Adam optimizer, where the mini-batch size and the learning rate are set to 32 and 0.005. The tradeoff parameters α, β, and the size of the local context window k are set to 0.6, 0.8, and 2 respectively. The parameter η is set to 5.

Comparison with Other Methods
We compare our model with the following three baseline methods proposed by : Indep, Inter-CE and Inter-EC. All these three methods are based on the two-step framework, and the differences of them mainly lie in the first step. Among them, Indep extracts emotion and cause individully, Inter-CE uses cause extraction to enhance emotion extraction, while Inter-EC uses emotion extraction to enhance cause extraction. For our model, we consider four variants of SLSN: SLSN-E, SLSN-C, SLSN-I, and SLSN-U, which correspond to the output of P elc , the output of P cle , the intersection of P elc and P cle , and the union of P elc and P cle , respectively.  As shown in Table 1, for the target ECPE task, our methods (i.e., SLSN-E, SLSN-C, SLSN-I, SLSN-U) all achieve better performance than the two-step methods (i.e., Indep, Inter-CE and Inter-EC) on F1score, and SLSN-U achieves the best performance (0.6545 on F1-score). When compared with the best baseline method Inter-EC, SLSN-E, SLSN-C, SLSN-I, and SLSN-U achieve 0.0401, 0.0337, 0.0326, and 0.0417 improvement on F1-score, respectively. More specifically, by observing the results on emotion extraction and cause extraction, we can find that when compared with the baseline methods, our models perform better on cause extraction (increase about 0.02 on F1-score) and worse on emotion extraction (decrease about 0.02 on F1-score). This indicates that our model SLSN has no advantage in emotion extraction, and the good performance on ECPE task may be owed to the effectiveness of cause extraction or emotion-cause pairing.

Effect of Components in LPS
In this section, we explore how the components in LPS affect the performance of SLSN. We mainly study two components of LPS, i.e., the attention function and the local encoder. We conduct the experiments by ablating them from the model or substituting them with other components. The results are shown in Table 2.
By comparing SLSN-U and the case (a) in Table 2, we can find that the case using attention function outperforms the case not using attention function by about 0.0083 on F1-score. Similarly, by comparing SLSN-U and the case (b), we can find that the case using local encoder (Bi-LSTM) outperforms the case not using local encoder by about 0.091 on F1-socre. This indicates that both the attention function and the local encoder are effective designs. For the local encoder, we also try to substitute Bi-LSTM with other kinds of transformation layers such as FC (i.e., Fully-Connected layer), CNN, and Transformer (Vaswani et al., 2017). Among these layers, we can find that Bi-LSTM achieves the best performance. Besides, CNN and transformer achieve similar performance which is only a little worse than Bi-LSTM (about 0.01 on F1-score). FC gets the worst performance which only brings a little improvement on the case not using encoder (about 0.01 on F1-score). This indicates that the context aware transformation layers (i.e., Bi-LSTM, CNN, and Transformer) are more suitable to be the local encoder, and Bi-LSTM is the best choice.

Effect of Local Context Window
In this section, we explore how the setting of local context window affects the performance of SLSN.
We study three settings of local context window: only previous k (i.e., the target clause and its previous k clauses), only following k (i.e., the target clause and its following k clauses), and symmetric (i.e., the target clause, its previous k clauses, and its following k clauses). For each setting, we study its effect on the performance of two basic models (i.e., SLSN-E, SLSN-C) and the best model (i.e., SLSN-U). We conduct the experiments by varying k from 0 to 6 step by 1. The results are shown in Figure 4. From Figure 4(a) and Figure 4(b), we can find that the performance of SLSN-C is poor under the setting of only previous k (about 0.3 on F1-score), while the performance of SLSN-E is poor under the setting of only following k (about 0.3 on F1-score). This implies that the cause clauses may tend to appear before the corresponding emotion clauses. From Figure 4(c), we can find that both SLSN-C and SLSN-E keep a good performance (about 0.63 on F1-score) under the setting of symmetric. This indicates that the setting of symmetric is more robust than other two settings. By observing the former three subfigures in Figure 4, we can find that SLSN-U keeps a stable performance (about 0.65 on F1score), no matter which setting of local context window is adopted. This means that SLSN-U is more robust to the setting of local context window than SLSN-C and SLSN-E. From Figure 4(d), we can find that SLSN-U achieves the best performance under the setting of symmetric. This indicates that the best choice of the setting of local context window is symmetric. In addition, by observing the window size k, we can find that all the three models achieve a relatively stable performance as k varies, except for the case k = 0.

Effect of Tradeoff Parameters
In this section, we explore the effect of the tradeoff parameters α and β on the performance of SLSN. For each parameter, we study its effect on the performance of three models (i.e., SLSN-E, SLSN-C, and SLSN-U). We conduct the experiments by varying α from 0 to 1 step by 0.1, and varying β from 0.1 to 0.9 step by 0.1. The results are shown in Figure 5.
From the left subfigure of Figure 5(a), we can find that when α = 0 or α = 1, the models get poor performance. When α ∈ [0.1, 0.9], the performance of SLSN-E and SLSN-C can be improved. This implies that both kinds of loss (i.e., L elc and L cle ) are important to SLSN, and either E-net or E-net can benefit from each other. In most cases, SLSN-U achieves the best performance, and SLSN-E performs better than SLSN-C. From the right subfigure of Figure 5(a), we can find that the three metrics (i.e.,

Document
Ground-truth Inter-EC SLSN-U Ms. Huang married a guy 20 years younger than her (c 1 ). In order to avoid losing her wealth (c 2 ), they have notarized their property before marriage (c 3 ). However the man stole Ms. Huang's money twice after marriage (c 4 ), and Ms. Huang submitted a case to the court helplessly (c 5 ).
(c 5 , c 4 ) None (c 5 , c 4 ) Chen and his wife have two boys and one daughter (c 1 ). The 18-year-old son works in Taizhong (c 2 ). When he knew his mother had been killed by his father (c 3 ), he was quite emotional (c 4 ). Then he saw his younger brother and sister (c 5 ), and they cried together (c 6 ).
From the left subfigure of Figure 5(b), we can find that when β ∈ [0.1, 0.9], the performance of all three models exhibit a upward trend as β increases. This means that the prediction of E-label and Clabel plays a more important role than the prediction of LC-label and LE-label in the training process of SLSN. Again, in most cases, SLSN-U achieves the best performance, and SLSN-E performs better than SLSN-C. From the right subfigure of Figure 5(b), we can find that the recall and F1-score of SLSN-U exhibit an upward trend and the precision of SLSN-U exhibits a downward trend as β increases. This implies that the LPS tends to extract more emotion-cause pairs as β increases.

Case Study
For the case study, we select two examples in the test dataset to demonstrate the effectiveness of our model. The ground-truth and the predicted results of Inter-EC and SLSN-U are given in Table 3.
For the first example, Inter-EC predicts None while SLSN-U predicts the correct emotion-cause pair (c 5 , c 4 ). For the Inter-EC method, it outputs the right pair only when the emotion clause set includes c 5 and the cause clause set includes c 4 . While for our method, we can extract emotion-cause pair according to emotion or cause clause. It is easier to establish a matching relationship.
For the second example, we can observe that many wrong answers are predicted by Inter-EC (e.g., (c 4 , c 5 ), (c 6 , c 5 )). Due to the use of Cartesian product operation, the connection between emotion clause and cause clause may be ignored, thus many irrelevant pairs will be introduced. It indicates that Cartesian product brings a lot of redundancy and the filter operation fails to filter out the irrelevant pairs. In our method, we can extract emotion-cause pair straightly in the local context window. So our method avoids the redundancy brought by Cartesian product. This is a reason that our method is better than Inter-EC method.

Related Work
Emotion cause analysis has been studied for about a decade Gui et al., 2016;. Previous studies on emotion cause analysis mainly focused on the emotion cause extraction (ECE) task Fan et al., 2019). Recently, based on the ECE task, a new and more challenging task named emotion cause pair extraction (ECPE) was proposed .
The ECE task was first proposed by  and was formalized as a word-level sequence labeling problem. But  suggested that clause may be a more appropriate unit than word for detecting cause. Later, based on this idea, Gui et al. (2016) released a Chinese ECE corpus from a public SINA city news. In this corpus, the ECE task was defined as a clause-level sequence labeling problem, the objective of which is to predict the cause clauses in a document given the emotion. For the following studies on ECE task, this corpus has become a benchmark dataset. While early studies mainly adopted the rule-based methods Gao et al., 2015;Gui et al., 2014) and machine learning methods (Ghazi et al., 2015) to deal with the ECE task, recent studies has begun to apply the deep learning methods to this task (Gui et al., 2017;Chen et al., 2018;Yu et al., 2019;Li et al., 2019;Fan et al., 2019;.
Although the ECE task is valuable in practice, its application in real-world scenarios is limited due to the reason that the emotion clauses are naturally not annotated. Considering this situation,  proposed the ECPE task and released a corresponding ECPE dataset based on the ECE corpus. To tackle the task, they proposed a two-step method, which first extracted emotions and causes individually by multi-task framework, and then got the emotion-cause pairs by pairing and filtering.

Conclusion and Future Work
In this paper, we propose a symmetric local search network (SLSN) to perform end-to-end emotion-cause pair extraction. SLSN can straightly extract the emotion-cause pair through a process of local search. This is realized by designing a special component, i.e., local pair searcher, which allows simultaneously detecting and matching the emotions and causes. Experimental results on the ECPE corpus demonstrate the effectiveness of our model.
In the future, we will consider to further improve the performance of emotion extraction and cause extraction by employing more powerful pre-trained encoder (e.g., BERT (Devlin et al., 2019)) or designing some auxiliary tasks to utilize extra knowledge. Besides, we will further explore the process of local pair search, and seek for more advanced implementations of the local pair searcher.