KnowDis: Knowledge Enhanced Data Augmentation for Event Causality Detection via Distant Supervision

Modern models of event causality detection (ECD) are mainly based on supervised learning from small hand-labeled corpora. However, hand-labeled training data is expensive to produce, low coverage of causal expressions, and limited in size, which makes supervised methods hard to detect causal relations between events. To solve this data lacking problem, we investigate a data augmentation framework for ECD, dubbed as Knowledge Enhanced Distant Data Augmentation (KnowDis). Experimental results on two benchmark datasets EventStoryLine corpus and Causal-TimeBank show that 1) KnowDis can augment available training data assisted with the lexical and causal commonsense knowledge for ECD via distant supervision, and 2) our method outperforms previous methods by a large margin assisted with automatically labeled training data.


Introduction
Event causality detection (ECD) aims to identify causal relations between events from texts, which may provide crucial clues for many NLP tasks, such as information extraction, logical reasoning, question answering, and others (Girju, 2003;Oh et al., 2013;Oh et al., 2017). For example, the causal relation that Kimani Gray was killed because of a police attack is needed to be detected in the following sentence: "Kimani Gray, a young man who likes football, was killed in a police attack shortly after a tight match." This task is usually modeled as a classification problem, i.e. determining whether there is a causal relation between two events in a sentence. To this end, most existing methods adopt a supervised learning paradigm (Mirza and Tonelli, 2016;Riaz and Girju, 2014;Hashimoto et al., 2014;Hu and Walker, 2017;Gao et al., 2019;Zuo et al., 2020). Although these methods have achieved good performance, they usually need large-scale annotated training data. However, existing event causality detection datasets are relatively small. For example, the EventStoryLine Corpus (Caselli and Vossen, 2017) only contains 258 documents, 4316 sentences, and 1770 causal event pairs. These small datasets are in low coverage of causal expressions and obstacle NLP applications deployed on large-scale data. Recent improvements of distant supervision have been proven to be effective to label training data for some tasks, such as relation extraction (Mintz et al., 2009), event detection (Chen et al., 2017), and so on. Therefore, we investigate a distant data augmentation framework for solving the data lacking problem on the ECD task, dubbed as Knowledge Enhanced Distant Data Augmentation (KnowDis), to automatically label available data.
We argue that a sentence contains an event pair with a high probability of causality and expresses its causal semantic can be labeled as training data for the ECD task. To automatically label a large number of training data, we need to solve the following three challenges. (1) How to collect a large number of event pairs with a high probability of causality and employ them to label training data. (2) How to handle noisy distantly labeled sentences that do not have well-formed textual expressions to express causal semantics. (3) How to make better use of distantly labeled sentences for training. To this end, we firstly design a Lexicon Enhanced Annotator (LexiAnno) to extract a large number of event pairs with a high probability of causality based on lexical knowledge and employ them to automatically label sentences via distant supervision. Secondly, we propose a Commonsense Filter (CommonFilter) to refine distantly This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.  labeled sentences assisted with causal commonsense knowledge which makes them more well-formed to express the causal semantics. Thirdly, we employ Relabeling and Annealing strategies to make better use of distantly labeled sentences for training. Finally, we evaluate KnowDis on two datasets and achieve the best performance training with distantly labeled sentences on ECD. The following sections describe the architecture (Section 2) of KnowDis and the experimental results (Section 3) on the ECD task.

KnowDis
As shown in Figure 1, we illustrate the three main components of our proposed KnowDis in this section.

Lexicon Enhanced Annotator (LexiAnno)
LexiAnno aims to extract a large number of event pairs with a high probability from external lexicons based on the annotated causal event pairs via a Causal Event Pair Extractor, and employ them to collect preliminary noisy labeled sentences from external documents via a Distant Annotator.

Knowledge
How to extract Why causality Abbr.
WordNet 1) Extracting the synonyms and hypernyms from WordNet of head word of each event in eij. 2) Assembling the items from the two groups of two events to generate causal event pair set.
Items in each group are the synonyms and hypernyms of the original causal event pairs.

E wn
VerbNet 1) Extracting the words from VerbNet under the same class as head word of each event in eij. 2) Assembling the items from the two groups of two events to generate causal event pair set.
Items in each group are in the same class of the original causal event pairs. E vn Table 1: Extracting causal event pairs from lexical knowledge bases.
Causal Event Pair Extractor. We expand each event pair e ij in annotated causal event pair set E g via external dictionaries. 1 . Table 1 illustrates the details of how to extract E wn and E vn from WordNet (Miller, 1995) and VerbNet (Schuler, 2005). Eventually, we construct a filter via transE (Bordes et al., 2013) based on maximum interval method: L = (e i ,e j )∈S (e i ,e j )∈S [λ + d(e i , e j ) − d(e i , e j )] + to sort extracted event pairs in ascending order of their distance and pick out the ones at the top of them with a high probability of causality, where S and S are the causal and non-causal event pair set respectively.
Distant Annotator. We keep the top 10% sorted extracted event pairs to obtain E f with a high probability of causality. Then we automatically label the 5% randomly selected sentences from NYT corpus 2 which contain any event pair e ij in E g and E f as the noisy distantly labeled training data D n .

Commonsense Filter (CommonFilter)
CommonFilter aims to refine D n assisted with causal commonsense knowledge to pick out labeled sentences which express causal semantics between events. Inspired by Luo et al. (2016), we introduce Pointwise Mutual Information (PMI) statistics (Church and Hanks, 1989) to indicate the causal semantics assisted with a if-then reasoning data Choice of Plausible Alternatives (COPA) (Gordon et al., 2011) and causal connectives. As shown in Table 2, we employ causality co-occurrences (f ) of each word pair between cause-related (T c ) and effect-related (T e ) text and incorporate necessity causality (CS nec ) with  sufficiency causality (CS suf ) to model causal relation. Specifically, we calculate the CS nec and CS suf of each word pair (i c , j e ) in (T c , T e ) from COPA and annotated data, and causality score CS s of two text spans (SP 1 , SP 2 ) of each sentence s in D n divided with connectives between two events: where, N is the size of all (T c , T e ) pairs, W is all calculated words and α is a penalty value to penalize high-frequency words. Next, we sort and divide sentences in D n into two parts based on CS s , D c n in which the two events are connected by a causal connective from C signal extracted from FrameNet (Baker et al., 1998) andPDTB2 (Group andothers, 2008), and the D nc n in which are not. Finally, we keep the top 50% data in D c n and 10% data in D nc n as refined distantly labeled training data D r .

Relabeling and Annealing
Event Causality Detector. We formulate event causality detection as a sentence-level binary classification problem. Specifically, we design a binary classifier based on BERT (Devlin et al., 2019) to construct the Event Causality Detector. The input of the detector is the event pair e ij and its corresponding sentence. We convert the sentence into BERT's input form, i.e. the sum of WordPiece embedding (Wu et al., 2016), position embedding, and segment embedding. We get the event representation e i and e j encoded by BERT. Then, we take the stitching of manual designed feature vector (same lexical, causal potential, and syntactic features representation as Gao et al. (Gao et al., 2019)) f , e i and e j as the input of top MLP classifier. Finally, the output is a binary vector to indicate the causality of the input event pair e ij . We employ relabeling and annealing strategies to make better use of distantly labeled data for training. (1) Relabeling: We pre-train a detector on annotated data and employ it to relabel the refined distantly labeled training data D r via self-training (Asai and Hajishirzi, 2020). Then, we collect the sentences that are relabeled as causal sentences to obtain the distantly relabeled training data D rr which are more casual and informative for the training of ECD task. (2) Annealing: Distantly labeled training data may not be appropriate at the beginning of training for building an effective detector due to noises. Therefore, we employ the annealing training strategy (Kirkpatrick et al., 1983) to maximize the effectiveness of distantly labeled training data. In the beginning, we only employ annotated data for training, and with the increase of epochs, we added D rr for training incrementally in a proportion of β.
(1) ESC: We use the same way to partition dataset as the SOTA method on ESC (Gao et al., 2019). Same as it, we use the last two topics as a development set. (2) Causal-TB: This dataset only contains 318 causal links which can further prove effectiveness of the proposed framework for solving the problem of data lacking. We use the same development set as ESC because of the SOTA method on this dataset (Mirza and Tonelli, 2014) does not partition the development set. Specifically, we conduct 5-fold cross-validation on the two datasets 3 . We tune the augmented proportion, α, and β on the development set. All the results are the average of three independent experiments. Parameters Setting. We apply the base-uncase-bert as the pre-trained BERT model. We set the learning rate of detector as 1e-5. Specifically, the dimension of the causal semantic space is 100. We set the α and β as 0.5 and 0.1 respectively based on the development set. We apply the early stop strategy and the SGD gradient strategy to optimize all models. We adopt Precision (P), Recall (R), F1 vaule (F1) as the evaluation metrics.
Compared Methods. We evaluate the performance of ECD on the same EventStoryLine corpus v0.9 (ESC) (Caselli and Vossen, 2017) and Causal-TimeBank (Causal-TB) (Mirza and Tonelli, 2014) dataset as SOTA methods. We select some typical methods and SOTA methods on ESC and Causal-TB respectively to make comparisons: (1) Cheng et al. (2017) and Choubey et al. (2017) 3) with extra data augmented by EDA which is a easy data augmentation framework (Wei and Zou, 2019). We also employ relabeling and annealing strategy when training with EDA. Finally, we automatically label 10132 sentences via KnowDis. We sample 100 sentences for manual evaluation, 82% of which clearly express the causal semantics (3 assessors, Cohen's kappa = 0.88).    Table 3 and 4 shows the results of our model compared with SOTA methods. From the results, we could have the main following observations. (1) Effectiveness of our method: Our method (KnowDis) significantly improves the performance of ECD by 5.0 and 2.0 points on F1 value on two datasets respectively. It illustrates that the augmented training data labeled via distant supervision, and refined via causal commonsense knowledge can provide more effective assistance for ECD task.

Comparisons with SOTA Methods on Event Causality Detection
(2) Necessity of causalrelated knowledge: We can observe that training with extra data augmented with EDA and KnowDis can both improve the performance of ECD task which shows that more training data can introduce more causal knowledge to alleviate data scarcity. However, the sentences produced via EDA are not refined by causal-related knowledge such as causal lexical and causal commonsense knowledge. Compared to it, the further significant improvement of KnowDis proves the necessity of the causal-related knowledge, and also illustrates that our model can produce more suitable training data for the ECD task.     5 and 6 tries to show the effectiveness of the key parts of our method on event causality detection (ECD). † denotes the same filtering and training processes except that the causal event pairs employed for distant labeling are different. From the results, we could have the following observations.

Effectiveness of Main Components on Event Causality Detection
(1) Effectiveness of distant labeling: The results of Annotated causal ep. † and +Extracted causal ep. † illustrates that the distantly labeled augmented training data can effectively alleviate the problem of data scarcity on ECD task. (2.1) (2) Effectiveness of LexiAnno: Compared to sentences labeled only based on annotated causal event pairs (+Annotated causal ep. †), training with sentences labeled based on extracted causal event pairs from knowledge bases (+Extracted causal ep. †) can bring effective and diverse knowledge for understanding event-causal semantics. (2.1) (3) Effectiveness of CommonFilter: The results of -Causal connective and -Causality co-occurrence show that our proposed commonsense filter (2.2) which introduces causal commonsense knowledge to refine distantly labeled training data can effectively enhance the causal-related semantics of augmented data. Specifically, the causal commonsense knowledge is more useful than causal connective knowledge because the former is more extensive than the latter in the expression of cause and effect. (4) Effectiveness of Relabeling: The results of -Relabeling show that relabeling (2.3) can reduce the noisy of distantly labeled data. (5) Effectiveness of Annealing: The results of -Annealing show that the annealing (2.3) can make better use of noisy distantly labeled data. Relabeling and annealing can both be applied to other distant supervision tasks.   (1) The more data retained of the distant label data in D c n , the more effective knowledge can be brought for training. However, when the retained data exceeds 50%, the noise caused by D c n is greater than the impact of effective knowledge. (2) Introducing appropriate distant label data in D nc n can bring additional effective knowledge but it contains more harmful noise than the data in D c n .

Conclusion
In this paper, we try to employ distant supervision to alleviate the data lacking problem on causal-related task. We propose a knowledge enhanced distant data augmentation framework (KnowDis) for event causality detection. Our method achieves the SOTA performance on EventStoryLine corpus and Causal-TimeBank dataset assisted with knowledge enhanced distantly labeled training data. In the future, we will introduce more causal-related resources and apply KnowDis for other relational tasks.