Conditional Causal Relationships between Emotions and Causes in Texts

The causal relationships between emotions and causes in text have recently received a lot of attention. Most of the existing works focus on the extraction of the causally related clauses from documents. However, none of these works has considered the possibility that the causal relationships among the extracted emotion and cause clauses may only be valid under a speciﬁc context, without which the extracted clauses may not be causally related. To address such an issue, we propose a new task of determining whether or not an input pair of emotion and cause has a valid causal relationship under different contexts, and construct a corresponding dataset via manual annotation and negative sampling based on an existing benchmark dataset. Furthermore, we propose a prediction aggregation module with low computational overhead to ﬁne-tune the prediction results based on the characteristics of the input clauses. Experiments demonstrate the effectiveness and generality of our aggregation module.


Introduction
Recently, the research on the causal relationships between human emotions and their corresponding causes has received much attention. Recognizing the causes of a specific emotion in a document is considered as more useful than only identifying the emotion, due to the great potential of helping people make reasonable decisions and avoid unnecessary loss (Gui et al., 2017;. There are currently two well-designed tasks concerning the causal relationships between emotions and their causes, the Emotion Cause Extraction (ECE) task Gui et al., 2016a,b) and the Emotion-Cause Pair Extraction (ECPE) task Chen et al., 2018). Specifically, the ECE task focuses on extracting the causes for a given emotion, while the ECPE task focuses on extracting emotions and the corresponding causes as pairs.
Despite their increasing popularity, both tasks only aim to extract the clauses containing causal relationships, and have neglected the possibility that the context clauses may be indispensable for the extracted clauses to have a valid causal relationship. Let us consider the following example: • Wu was diagnosed with advanced liver cancer at the beginning of 2014, • since when he began to update his health condition in Microblog and has attracted much attention from many users.

• If Wu didn't update his microblog for a long time,
• people worried that he may have passed away.
• There was one time that Wu hadn't updated his microblog for about two months, • and he had received a lot of messages concerned about his health conditions. • ... Figure 1: Example document, where the yellow-green color clauses are the context clauses with important information, the red one is the corresponding cause clause, and the blue one is the targeted emotion clause.
In the above example, the cause and emotion clauses may not have a causal relationship if we ignore the context clauses, since the reasons of not updating one's social media account can be more than the owner passing away (e.g., forgetting his/her password, using a new account, etc.). Only when extra information contained in the context clauses is available, can these two clauses have a valid causal relationship.
Therefore, it is essential to consider the context clauses as seriously as the targeted emotion and cause clauses when determining their causal relationships. With these context-dependent causal relations figured out, more complete and meaningful information about these causal relationships can be extracted. Specifically, we can learn that some types of events may evoke different types of emotions under different circumstances, while some will always evoke one type of emotion under any circumstance. Such useful information about causal relationships can be beneficial in many emotion-related applications, such as accurately predicting one's emotions with context taken into consideration when a specific event occurs.
Despite the ubiquitousness of the contextdependent causal relationships described above, few works have paid attention to them. In this work, we articulate the importance of context in the problem of causal relationship recognition, and make the first step of studying such special causal relationships in text data.
We propose a task to determine whether or not the input pair of emotion and cause clauses has a causal relationship given some specific context clauses. As our task is new without any existing dataset available, we manually label the documents in the ECPE dataset constructed by  and follow the procedure of negative sampling (Mikolov et al., 2013) to build our own dataset. The constructed dataset can be further processed and used in some other important tasks that we aim to focus on in the future, such as quantifying the effect of context on the causal relationships.
Furthermore, we propose a prediction aggregation module with low computational overhead to fine-tune the prediction results according to the characteristics of the input clauses. The experiments on our constructed dataset demonstrate the effectiveness and generality of our aggregation module.
The contributions of this work can be summarized as follows.
• To address the issue that context can be indispensable for some causal relationships to be valid, we define a new task to determine whether or not an input pair of emotion and cause has a causal relationship under different contexts.
• Based on the ECPE dataset, we construct a dataset for our proposed task via manual annotation and negative sampling, which can also be used in some other important tasks, such as quantifying the effect of context on the causal relationships.
• We propose a general prediction aggregation module with low computational overhead, which can be used together with most existing models and significantly improve the prediction results on our proposed task.

Related Works
As known to all, context has been utilized in many text-related applications to provide semantic information and improve task performance (Kruengkrai et al., 2017;Kayesh et al., 2019;Zhou et al., 2016;Li and Mao, 2019). In traditional causal reasoning, the term "context" is mostly discussed in the task of causal effect estimation, which is to estimate the influence of the cause variable on the effect variable (Guo et al., 2018). In some cases, the change of the effect variable relies not only on the cause variable, but also on some other relevant variables which can be viewed as the context variables and called confounders. Therefore, these confounders should be discovered and "removed" by some specially designed algorithms in order to accurately estimate the causal effect, such as propensity score method (Gu and Rosenbaum, 1993;Austin, 2011;Imbens, 2004;Lunceford and Davidian, 2004), front-door criterion (Pearl, 1995), instrumental variable estimator combined with structural causal models or potential outcome framework (Guo et al., 2018), etc.
Unfortunately, most of these existing works are not designed for text data of unstructured data format. Also, before the noisy causal effect from the confounders can be removed, these confounders need to be discovered and represented in an explicit form, which is another challenge for text data to fit in the existing causal effect estimation models. A potential solution proposed by Sridhar and Getoor (2019) is to represent the confounders in text data as the topic distribution vectors achieved by Latent Dirichlet Allocation model. However, such a representation is not theoretically explainable to carry enough information to represent these confounders. Also, before estimating the causal effect, we need to first discover these causal relationships which are determined together by cause and context in text data.
The tasks concerning causality extraction in text can be mainly divided into two categories, causal phrase extraction and causal clause extraction, where the former focuses on extracting word

Document with a non-conditional pair
The convenience store was at the corner of the street. Recalling the bloody murder in the early morning, the store owner still felt terrified. She was tallying the goods when she heard a scream from the outside. ...
phrases that have causal relationship in one sentence (Hashimoto et al., 2014;Zhao et al., 2017), and the later focuses on extracting multiple clauses from a document (Gui et al., 2016a,b;Chen et al., 2018). Existing models for causal phrase extraction are mostly based on combinations of syntactic patterns and machine learning techniques, which first extract candidate phrases based on predefined templates, and then train a classifier to classify the candidate causal pairs. On the other hand, existing models for causal clause extraction are mostly based on deep learning models to extract abstract features for each clause, in order to accurately classify whether or not some clauses are causally related. Although the context of the input document is always involved in providing more semantic information to enhance the embedding vectors of clauses, none of these works has paid attention to the possible effect of context on the causal relationship itself. Moreover, as we focus on emotion causal relationships in this paper, for some emotions (e.g., shame, envy, guilt, etc.) to arise in the first place, a particular social setting may be necessary. Therefore, taking the social contexts into consideration may be an essential step to study whether a specific event can cause an emotion (Wilutzky, 2015;Marsella et al., 2010;Jurafsky, 2004). In our work, we articulate the importance of context in the problem of causal relationship recognition, in view that the context can be essential in order for a pair of emotion and cause to have a valid causal relationship.

Task Definition
In this section, we first formally define the term "conditional" based on the concept of emotioncause pair used in the ECE and ECPE tasks, and then formulate our proposed task based on such conditional emotion-cause pairs.
As defined by , an emotioncause pair (ECP) contains an emotion clause indicating an emotion (e.g., Happiness) and a set of corresponding cause clauses. We define "conditional ECP" as follows.

Definition 1 (Conditional Emotion-Cause Pair)
If an emotion-cause pair is considered to have causal relationship only when a specific context is given, it is called a conditional emotion-cause pair.
Examples of documents with a conditional pair and a non-conditional pair can be found in Table 1. Specifically, for the document with a conditional pair, in general, most people would not worry about whether a social media user updates his/her account or not, but with specific context like the one in the document, Wu had already gained much attention and hence people cared about his life. As for the document with a non-conditional pair, one shall feel frightened whenever he/she witnesses a bloody murder, which is unlikely to change with different contexts.
Definition 1 indicates that the conditional pairs should not be judged to have causal relationship when an irrelevant context or no context is given. Considering such a property, our task is formulated as follows.
The proposed task Given a specific context con i and an emotion-cause pair x i = (C i , e i ) containing a set of cause clauses C i and an emotion clause e i , determine a binary label y i to indicate whether or not the input pair x i has causal relationship under the context con i .
As defined above, our proposed task is not to directly distinguish the conditional pairs from the non-conditional ones. The reason is that the recognition of a conditional pair is based on its different labels under different contexts instead of the text itself. Therefore, such a task formulation spares the models from worrying about how to transform the labels of causal relationship to the labels of conditional pairs, and simplifies the process of training.

Dataset Construction
Different from the ECPE task , the ECPs are directly given in our task and the goal of our task is to judge whether or not they are causally related under a specific context. Therefore, our proposed task is new without any existing annotation available, so we construct our own dataset based on the ECPE dataset  by the following two steps: manual annotation and negative sampling.

Manual Annotation
In the ECPE dataset, each document contains one ECP composed of an emotion clause and a set of cause clauses. These documents are mainly snippets of news articles or social media documents.
To manually annotate the documents with the labels of conditional pairs, we have recruited three human experts who are required to give a binary label to each document: 1 indicates the ECP in this document is conditional, and 0 indicates it is not. Specifically, these three experts are experienced academic partners in the area of emotion cause extraction.
To label an ECP as a conditional one, the cause events and the effect emotions should be less or not relevant under normal circumstances. For example, in general one shall not reject the care of nurses when he/she is ill, but someone with racial prejudice may feel disgusted with foreign nurses. Such context information contained outside of the cause and emotion clauses is what the three experts are required to find and judge whether these context information is essential for the targeted ECP to have a causal relationship.
We have inspected the labels provided by the three experts and the average kappa value among them is 0.8675, which indicates the fidelity of these manual labels. With three labels from the three human experts, we adopt the majority vot-ing scheme to determine the final label for each document. For example, given a document, if two experts agree on labeling the ECP in this document as a conditional pair, then the final label for this document is 1. The details of the annotated dataset are shown in Table 2, where N noncon denotes the number of documents with non-conditional pairs, and N con denotes the number of documents with conditional pairs.

Negative Sampling
Although through manual annotation we have obtained the labels of conditional pairs, all ECPs in the resultant dataset are supposed to have valid causal relationships, since the conditional ones are all given their correct contexts and the nonconditional ones do not depend on any context. In other words, the current dataset only has "positive" instances, but for our proposed task we also need "negative" instances to train a classification model.
To generate such "negative" samples, we follow the procedure of negative sampling (Mikolov et al., 2013). Specifically, we define the following two types of "negative" samples: • Context-type: The context-type negative sample of a document is generated by replacing its original context with a randomly sampled context from the other documents, while keeping the ECP unchanged.
• Emotion-type : The emotion-type negative sample of a document is generated by replacing its emotion clause with a randomly sampled emotion clause (indicating a different emotion) from the other documents, while keeping the other clauses unchanged. Table 3 shows examples of the two types of generated "negative" samples. Specifically, compared with the original document, the generated contexttype document has a totally different set of context clauses (i.e., the italic clauses enclosed by " " and " "), which does not provide the information that Wu was diagnosed with cancer and has been updated his health condition in Microblog. Therefore, the generated context-type document is expected to have no causal relationship due to the irrelevant context. As for the generated emotion-type document, the emotion clause is replaced with a clause of Happiness, which makes no sense since the cause clause and the new emotion clause are now irrelevant.

Context-type
When Bai was notified that his advice was adopted by the National public security bureau, he was cooking dinner for his children. If Wu didn't update his microblog for a long time, people worried that he may have passed away. ...

Emotion-type
Wu was diagnosed with advanced liver cancer at the beginning of 2014 and began to update his health condition in Microblog. If Wu didn't update his microblog for a long time, he was really happy that he could finally afford his dream house. ...  To summarize the labels of causal relationships of the generated documents, for the generated context-type documents, those with a conditional pair will not have causal relationships due to their irrelevant contexts, while those with a nonconditional pair will still have causal relationships. As for the generated emotion-type documents, all of them will not have causal relationships, since the original cause clauses should only lead to the original emotion given the original context. Therefore, suppose that n denotes the number of each type of "negative" documents generated for each original document, we can calculate the number of documents with and without causal relationships in the constructed dataset as follows: where N pos denotes the number of documents with causal relationships, N neg denotes the number of documents without causal relationships. In order to generate a balanced dataset for our proposed task, we need to make sure that: from which n should be around 1.858. Since n should be an integer, the possible choices are n = 1 and n = 2, while a larger n may create an imbalanced dataset with too many negative samples and cause the models biased towards the negative labels. Based on our preliminary experiments conducted to validate the setting of n, the results show that the constructed dataset with n = 1 is not reasonable 1 . Therefore, we set n to 2 to construct our dataset and the details of the constructed dataset are shown in Table 4.

Architecture
In this section, we introduce our architecture for our task, and propose a simple, general and effective prediction aggregation module based on the characteristics of conditional ECPs.

The Framework
As shown in Figure 2, our framework contains three main modules: a clause embedding module, a context encoding module, and a newly proposed prediction aggregation module. First, in the clause embedding module, the word embedding vectors of the input clauses are passed into a Bi-directional Long Short-Term Memory (BiLSTM) model to obtain informative clause embedding vectors for each clause. Then, in the context encoding module, context information is encoded into the clause embedding vectors of the input ECPs by using one of the three classic methods: explicit concatenation, implicit encoding and attention-based method. generate the final prediction. Below, we introduce these three modules in detail. For simplicity, the formulas in subsequent discussions only consider the case where there is only one cause clause. Note that it can be easily extended to multiple cause clauses by concatenating their embedding vectors together.

Clause Embedding Module
To obtain an embedding vector for each word, we use the word embedding vectors released by , which are trained using the word2vec algorithm (Mikolov et al., 2013). To encode words' embedding vectors into a clause embedding vector, we adopt BiLSTM model, which is capable of generating an informative vector for each clause by passing words' information along the clause forwards and backwards. Specifically, for the i-th document, the input of this module includes three parts: the cause clause c i , the emotion clause e i , and the context clauses con i . Note that the cause clauses and the emotion clause are already annotated in the dataset, so the remaining clauses of the input document are denoted as the context clauses.

Context Encoding Module
After we retrieve an embedding vector for each clause, in order to determine the causal relationship of the input ECP under a specific context, we need to encode context information into the embedding vectors of the input ECP for the subsequent prediction. In this aspect, we consider three classic methods used most frequently in the area of text processing: explicit concatenation, implicit encoding, and attention-based method. The performance of these methods will be shown and discussed in Section 6.
Explicit concatenation As indicated by the name, this method directly concatenates the embedding vectors of the context clauses to those of the input pair, and passes them to the next module so that the final prediction is based on all clauses, i.e.,x Implicit encoding The second method aims to encode context information implicitly into the embedding vectors of the input ECP via an extra layer, such as BiLSTM or Convolutional Neural Network (CNN). Considering that the relevant information may locate anywhere of the context clauses, CNN may not be a good choice due to its fixed neighborhood size. Therefore, we adopt BiLSTM model for implicit encoding of context information.
Attention-based method The third method is based on the self-attention module proposed by Vaswani et al. (2017), which has achieved great success recently in translation work. We adopt the 1-layer-multi-head self-attention module to encode context information. Specifically, instead of calculating the attention scores among all sentences in the original self-attention module, we only calculate the attention scores between the input pair and the context clauses, which reduces unnecessary attention weights, and targets at generating the context-encoded embedding vectors of the input ECP for the subsequent prediction.

Prediction Aggregation Module (PAM)
As defined in Section 3, the conditional pairs will no longer have causal relationships if an irrelevant context or no context is given, whereas the non-conditional pairs will always have valid causal relationships. Taking such a difference into consideration, here we propose a simple, general and effective prediction aggregation module.
First, to get the prediction with context, we pass the context-encoded embedding vectors of the input pair,x i , to a fully-connected layer with a softmax activation function: where W c is a trainable weight matrix.
Next, we add an extra step of predicting the labels of causal relationship directly based on the original embedding vectors of the input pair, without encoding the context information. Specifically, we pass the original embedding vectors achieved in the clause embedding module, x i , to a fullyconnected layer with a softmax activation function: where W o is a trainable weight matrix. The proposed module works as follows. If P (y o i ) has already shown that the input pair has a valid causal relationship (i.e., P (y o i = 1) > P (y o i = 0)), then this pair is more likely to still have a causal relationship under any specific context, and the final result should depend more on the prediction without encoding context information. On the other hand, if the input pair is predicted to have no causal relationship without context, the final result should give more weight to the prediction taking context information into consideration. Following this logic, we can have the following aggregation formula: With this aggregation module, the model can handle both conditional and non-conditional pairs, and give a better prediction on the causal relationship of an input pair under a specific context.

Experiment
In Section 4, we have described the process of dataset construction and the details of the constructed dataset 2 . In this section, we conduct experimental studies to evaluate our approach, and analyze the experiment results pragmatically.

Baseline Models
As mentioned in Section 5.3, there are three options for the context encoding module. Therefore, we consider three baseline models without PAM, each of which contains one of the three context encoding methods we described.
• BiLSTM + Concatenation: this baseline model uses BiLSTM at word level to get the clause embedding vectors and directly concatenates the context clauses' vectors to those of the input pair.
• BiLSTM + BiLSTM: this model uses BiL-STM at both word level and clause level to get the context-encoded embedding vectors of the input pair for the final prediction.
• BiLSTM + Self-Attention: this model uses BiLSTM at word level and uses Self-Attention at clause level to encode the context information.

Experiment Settings
We randomly select 90% of the data for training and the remaining 10% for testing. To avoid the effect of randomness, we divide the whole dataset into 10 folds and repeat the experiments 10 times with each fold being testing data. The average experiment results are reported in the following sections. Since our proposed task is a binary classification task, we adopt the traditional precision, recall and F1 scores to evaluate the prediction performance.  - † and ‡ denote the statistical significance for p < 0.01 and p < 0.001, respectively.
As for the detailed design of the models, the hidden units in BiLSTM is set to 100, and the heads in Self-Attention module is set to 2. All weight matrices are randomly initialized with uniform distribution. For training, we use the stochastic gradient descent algorithm and Adam optimizer, with batch size set to 32 and learning rate set to 0.005. Also, for regularization, dropout is applied with dropout rate set to 0.2, and a L2-norm regularization term is added to constraint the softmax parameters, where the weight of the regularization term is set to 1e −5 .

Experiment Results
In this section, we report the experiment results in Table 5 to validate our setting of n. We conduct experiments on the constructed datasets by setting n to 1 and 2. We notice that all our baseline models achieve unbelievably high recall values (i.e., 0.98 ∼ 0.99, see gray cells in Table 5) when n = 1. After looking into the detailed predictions, we find that when n = 1, the models unreasonably predict all test samples to have positive labels, which reveals that the models are heavily biased towards the positive labels due to insufficient negative samples. In contrast, the performance of the baseline models becomes more reasonable when n = 2. Therefore, we conduct our following experiments on the constructed dataset using n = 2.

Effect of PAM
To validate the effectiveness of PAM, we conduct experiments on the three baselines and report the results in Table 6. As shown in the table, before we add PAM, "BiLSTM + BiLSTM" achieves the highest F1 score compared with the other two models, possibly due to that the Self-Attention module needs a larger-scale dataset to train well, while simple concatenation cannot get semantic embedding vectors. After adding PAM, the F1 scores of the three baseline models are improved on average by 3.47%, and the results with p-value attached indicate the models containing PAM significantly outperform those without PAM.
Specifically, the effect of adding PAM from high to low is "BiLSTM + Concatenation", "BiL-STM + Self-Attention", and "BiLSTM + BiLSTM". This seems to imply that the improvement of PAM should be small when the model without PAM can already encode contexts well. The above results demonstrate the generality of PAM that it can be easily used together with existing classic models, and for our proposed task, PAM can improve their prediction performance significantly.

Case Study
To further illustrate how our aggregation module (i.e., PAM) improves the performance, we inspect the predictions of four examples given by the "BiL-STM + Concatenation + PAM" model. As shown in Table 7, Documents #1 and #2 share the same conditional ECP, and Documents #3 and #4 share the same non-conditional pair.
For Documents #1 and #2, since people would not care about Wu's health condition if he did  ) is the prediction based on only the input pair, P (y c i ) is the prediction with context encoded, and P (ŷ) is the final predicted probability. 2 Red clause is the cause clause and blue clause is the emotion clause not begin to update his information in his social media account, we can judge that Document #1 should have a causal relationship while document #2 should not, corresponding to their true labels being 1 and 0, respectively. The prediction without context P (y o i ) indicates that both documents have no causal relationship since the pair is a conditional pair. Taking context into consideration, the prediction with context P (y c i ) indicates that document #1 has a causal relationship, while document#2 still has no causal relationship due to its irrelevant context. The difference among these predictions corresponds to the characteristics of the conditional ECPs, which is to depend more on P (y c i ) when P (y o i ) indicates no causal relationship. As for Documents #3 and #4, one shall feel frightened whenever he/she witnesses a bloody murder around him/her, which is unlikely to change with different contexts. Therefore, both documents should have causal relationships. As shown in the table, P (y o i ) already indicates that the pair is causally related regardless of context and hence the final prediction indicates the same result.
The above cases illustrate that our simple aggregation module enables the model to simultaneously deal with documents containing conditional and non-conditional ECPs, and to fine-tune the final predictions accordingly.

Conclusion and Future Work
In this paper, we articulate the importance of context in determining the causal relationships between emotions and their causes. To address this problem, we define a new task of determining whether or not an input emotion-cause pair has a causal relationship under a specific context. We construct a dataset for our task through manual annotation and negative sampling based on the ECPE dataset. Furthermore, we propose a prediction aggregation module (PAM) with low computational complexity, to enable the models to dynamically adjust the final prediction according to the type of emotion-cause pair contained in a document. Experiments demonstrate the effectiveness and generality of our proposed PAM.
In view of the importance of context in the conditional causal relationships we define in this work, what we have done is only the first step. There remain many important and interesting problems ahead of us. For example, how to quantify the effect of context on the targeted causal relationship is another important task to study this type of causal relationship. Besides, how to enable the existing emotion-cause pair extraction models to consider the effect of context is also a meaningful task.