A Review of Dataset and Labeling Methods for Causality Extraction

Causality represents the most important kind of correlation between events. Extracting causali-ty from text has become a promising hot topic in NLP. However, there is no mature research systems and datasets for public evaluation. Moreover, there is a lack of unified causal sequence label methods, which constitute the key factors that hinder the progress of causality extraction research. We survey the limitations and shortcomings of existing causality research field com-prehensively from the aspects of basic concepts, extraction methods, experimental data, and la-bel methods, so as to provide reference for future research on causality extraction. We summa-rize the existing causality datasets, explore their practicability and extensibility from multiple perspectives and create a new causal dataset ESC. Aiming at the problem of causal sequence labeling, we analyse the existing methods with a summarization of its regulation and propose a new causal label method of core word. Multiple candidate causal label sequences are put for-ward according to label controversy to explore the optimal label method through experiments, and suggestions are provided for selecting label method.


Introduction
Causality represents a kind of relationships between "cause" and "effect", stating that the happening of causes will trigger the happening of effects, which constitutes the basics for inference and reasoning. In early days, people found causal relations from production process and daily lives manually, slow and inaccurate. The ever growing web resources have made it feasible to automatically extracts causality from text, triggering an emerging and hot topic in NLP, with abundant downstream application tasks such as event detection and prediction (Radinsky et al., 2012), questions answering (Hashimoto et al., 2014). For example, the new coronary pneumonia has drawn global wide attention recently, the causal relationship between "new coronavirus" and "eating wild animals" can be inferred by an expert system with causal knowledge, and it may be predicted that "new coronary pneumonia" can lead to "death". Artificial intelligence can be used to assist medical research and strengthen prevention and treatment during early stage of infection. Therefore, causal relation extraction is a basic task in text mining, which is driven by human's instinctive desire for knowledge.
Many researchers have devoted themselves to the research of causal relationship extraction, however, it is still a new field with some open problems to slove and publicly evaluated datasets. The existing relevant studies each has its own research system that cannot be compared horizontally, so the systematic research system is the key factors for the progress of causality extraction research. Targeting at the limitations and problems in causality research, we comprehensively review the basic concepts, extraction methods, experimental data, labeling methods from multiple perspectives, for future research on causality extraction as a reference. The main contributions of this paper are listed as follows: • We summarize the concept of causality and the existing research methods of causality extraction.
• Targeting at the causal research method of sequence labeling, we comprehensively summarize and analyze the existing methods. Multiple candidate causal labeling sequences are set to avoid the ambiguity of labeling, and the optimal method is explored through experiments. Our results show that the "core word" causal sequences labeling method archives the best effect, giving suggestions for the selection of labeling method.
2 Causality and causality extraction

Causality
Causality is the correspondence between "cause" and "effect", the "cause" is the producer of "effect" and the "effect" is the outcome of "cause". There are many elements to express causal semantics.

Causal unit
Different studies have different demarcations of causal boundary, and different text sentences are also applicable to different "causal unit". We summarize four kinds of causal units: • Word: In the sentence texts which word can fully express causal semantics, word can be used as the unit for causal boundary division. Such as sample " e1 Suicide /e1 is one of the leading causes of e2 death /e2 .", the word "suicide" is the cause, the word "death" is the effect.
• Phrase: In some sentences, the causal semantics of "word" is not complete, so it need to take "phrase" as the unit. Such as sample " e1 Financial stress /e1 is one of the main causes of e2 divorce /e2 .", the phrase "financial stress" is a more complete way to express causal semantics than the word "stress".
• Clause:In some sentences that cannot extract the core word or phrase for causality, the "clause" should be used as the unit. Such as sample " e1 We play with a steady beat /e1 so that e2 dancers can follow it /e2 .", it is impossible to extract the core word or phrase from the text to express causality.
• Event: An event is defined as a fact that takes place at a particular time and place, with several actors and the performance of several action characteristics. Such as sample " e1 A car travelling from Guizhou to Guangdong collided head-on with a bus /e1 results the e2 ten people, six men and four women, including the driver, died at the scene /e2 .", semantically, the event 1 causes the event 2.  (Luo et al., 2016): The "cause" sufficiently leads to the "effect". Such as the causal pair (storm, damage), the occurrence of "snowstorm" will inevitably lead to "damage", but "damage" is not necessarily caused by "snowstorm", but may be "earthquake", "flood" and so on.
• Necessary causality (Luo et al., 2016): The occurrence of "effect" inevitably leads to the embodiment of "cause". Such as the causal pair (rainfall, flooding), the occurrence of "flood", the high probability is the cause of "rain", but "rain" does not necessarily lead to "flood", but also can lead to "traffic jam".
• Temporal causality and Granger causality (Granger, 1988): In addition to "cause" and "effect", there is also "time" factor in the causality, Granger causality means the cause must precede the effect.

Multiple Causality
According to the correspondence between cause and effect, causality can be divided as one-cause and one-effect, one-cause and multi-effect, multi-cause and one-effect, multi-cause and multi-effect. With the increasing number of causality entities, the causal relationship shows special forms.
Embedded-causality A particular kind of multiple causality that manifests itself in the causal semantic . There are entities with different causal semantic in different causal pairs. Such as sample "He testified that cause of e1 death /e1 was e2 massive bleeding /e2 into the blood sac of the heart caused by e3 a stab wound /e3 on the left chest", the two causal pairs "(e2,e1), (e3,e2)" mean that "massive bleeding" is the cause of "death" and the effect of "a stab wound". There is a causal chain "a stab wound→massive bleeding→death" in the causality, so we also call it "chain-causality".
Cross-causality A particular kind of multiple causality that manifests itself in the causal position. In traditional causality, causes and effects appear continuously in the text, which means the adjacent two causal entities are assumed to be the same causal pair. However, in some special multiple causality, a causal pair appears intermittently, and multiple causal pairs "cross" in the text, we call it as "crosscausality". Such as sample "A e1 fire /e1 broke out in the school, due to the help of e2 wind /e2 , the fire department e3 dispatched /e3 8 fire engines and 33 commanders to e4 help /e4 " with two causal pairs "(e1,e3), (e2,e4)", between the two entities "fire" and "dispatched" in the first causal pair, the cause "wind" in the second set appears. If only the label sequence "CCEE" is used for labeling, the corresponding causality cannot be recognized, so it needs to be identified through other labeling methods.
Explicit causality (1) with explicit connective: with the verbs such as "because" which have obvious causal meanings.
(2) with ambiguous connectives: there are three forms such as "increase" (verbs with causal meaning can be realized by means of instrumental verb patterns), "generate(by)" (making causal agency inseparable from a situation), and "due to" (non-verb mode).
Implicit causality Implicit causality is often expressed in absence of causal connectives in the text, or there is only one entity of the cause or effect, and the other hide in the semantics of context.

Causality extraction
Many scholars have devoted themselves to the causality extraction researches, however, existing approaches are very scattered with no systematic research system. The early research used pattern matching to extract causality according to the structural characteristics of causal text. With the advancement of machine learning theory, the scope of text forms keeps expanding to multiple forms. With the increasing popularity of deep learning, CNN (convolutional neural network), RNN (recurrent neural network) and other deep learning models have been added to the research on causality extraction.
Based on the existing scattered research results, we summarize and sort out the three causal research fields: text classification, relation extraction, sequence labeling.
Text classification method Text classification is used to automatically classify a given text according to a certain classification system. Causality extraction based on text classification method is to classify text sentences according to whether they contain causal relation (Blanco et al., 2008;Hidey and McKeown, 2016;Kayesh et al., 2019;Paul, 2017). The method does not need to extract the causal events or entities in the text, only judge whether the whole sentence contains the causality, which is applicable to the causal data which is difficult to extract the events or entities from the text (the causal unit is clauses).
Relation extraction method Relation extraction is to judge whether the entity pair in a sentence has the specified relation, which is a dichotomy problem. Causality extraction based on relation extraction method is determining whether the causal pair given in the text has a causal relationship, which is applicable to the sentences in which the causal entity has been extracted. The existing models have naive Bayesian mode (Zhao et al., 2016), BGRU (bidirectional gated recurrent network) (Feng et al., 2018), multicolumn CNN (Kruengkrai et al., 2017), K-CNN (Knowledge-oriented CNN) (Li and Mao, 2019).
Sequence labeling method For a one-dimensional linear input sequence, each element in the linear sequence is labeled with a tag in the tag set. If the tag is set as causal semantic tag, the problem of causality extraction is transformed into sequence labeling, which is to label the causal tag for each word in the sentence, extracting the causality entity and determining the direction of causal relation. The existing models have CRFs (Fu et al., 2011), SCIFI (Self-Attentive Bi-LSTM-CRF with Flair Embeddings) , L-BL (linguistically informed Bi-LSTM) (Dasgupta et al., 2018) and Bi-LSTM+CRF+ S-GAT (graph attention networks base on syntactic dependency graph) (Xu et al., 2020).
Except the above three common causality extraction methods, there are many new research methods.
• Causal relationship of event: Some studies for causality are conducted as the unit of "event". (Fu et al.,2011;Hashimoto et al.2012;Do et al.2011;Zhao et al.2017;Mirza and Tonelli 2014;Mirza 2014) • Causal network: The network created by the causality events or entities extracted from a large number of corpus texts to improve the accuracy of the model or to be used for specific research. Causal relationship extraction has been widely used in various fields, such as the biomedical information (Nordon et al., 2010;Khoo et al., 2000;Raja et al., 2013;Fluck et al., 2016;Casillas et al., 2016;Bollegala et al., 2018), psychology (White, 1990), analysis of social discrimination (Qureshi et al., 2016), log query (Sun et al., 2007), image (Kocaoglu et al., 2017), etc. Therefore, causal relationship extraction, as an important task, has been infiltrated into the research of various fields.

Dataset for causality
Experimental data is the basis of all research, however, there is currently no publicly evaluated dataset. which is one of the key factors that hinder the progress of causality research. According to the deficiencies for the causal dataset construction, we analyze the existing dataset for the subsequent studies.

Existing public dataset
SemEval (Semantic Evaluation) A public dataset for relation extraction, which has multiple relationships including instrument-agency, product-producer, etc., the cause-effect is the subject of this paper. There are 1368 sentences with causality and 107 sentences without causality. The advantages are: (1) Strong credibility. (2) Clarify "cause" and "effect": The causal and effect can be directly obtained according to the marked causal entities and the direction of causality. (3) Wide range of application: Simple data processing can be employed in the traditional three methods of causal extraction. While it has many disadvantages: (1) Small data amount : It cannot meet the needs for experiment, researches have extended it (Feng et al.2018;Li and Mao 2019;Xu et al., 2020). (2) Sample imbalance: The ratio of positive and negative cases of causality is about 12:1, other relationships should be added as the negative cases in the classification method (Feng et al.2018;Silva et al.2017). (3) In-consistent labeling standards: There is no unified labeling standard, so most related works have their own labeling methods (Dasgupta et al., 2018;Xu et al., 2020). (4) It is a dataset for relation extraction, which only focus on whether the entities in the label have the relationship, but ignores the entities outside the label, which needs to be expanded manually Xu et al., 2020).

Causal-TimeBank and Event StoryLine (Li and Mao 2019)
The format and usage method are basically the same as SemEval. However, the author has only disclosed a small amount of his data.
Altlex (Hidey and McKeown, 2016) It extracted the sentence text with causality from the English wikipedia corpus. There are 4,595 causal sentences and 39,645 no-causal sentences, with large data amount. However, it can only be applied to text classification method according to the unmarked entities. SCIFI  In view of the defects and shortcomings of SemEval, SCIFI extended one causality to multiple causality, word to phrase. There are 1270 and 3966 sentences with and without causality respectively in the dataset (SCIFI is the name of the model, we call the dataset as SCIFI).

Summary of existing dataset for research methods
We summarize the application of existing six publicly available causality datasets in the three traditional researches is shown in Table 3.2. If the sequence labeling method is adopted, all sentences with causality in the dataset should be taken as the experimental data. For the two classification methods, all sentences with causality can be taken as positive cases, and sentences without causality and with other relationships can be taken as negative cases for experiment. If the causal entity is marked in the text, remove the entity labels leaving the pure sentence to judge whether it has causality (text classification), determine whether the given causal pairs has the causality directly (relation extraction), and set the label rules to generate the causal tag sequence (sequence labelling). If there is no mark of causal entity in the dataset, which can only adopt the text classification method to extract causal relation.

Causal sequence labeling method
For the causality extraction based on sequence labeling method, how to label the causal sequence is the key to the formation of experimental data. However, there is no fixed labeling rule, and horizontal comparison is impossible at present. We summarize the existing labeling methods and conduct multiangle analysis, as a reference for the follow-up research.    The method uses the "BIO" (Begin, Inside, Other) entity boundary tag in NER to mark the causal phrase boundary. Three causality semantic tags "C"(Cause), "E"(Effect), and "Emb(Embedded-causality) are used to represent the causal semantics, , the "Emb" tag is introduced to solve the problem that embedded-causality cannot give accurate labels.   Figure 2: Example for the causal labeling method of clause subscript. The tags "C1", "E1" and "CC1" belong to the first set of causal pair, and the compound tag "E1C2" represents the entity is the effect of the first causal pair, and the cause of the second causal pair, deftly solving the embedded-causality Advantages: (1) Using clause as the unit solves the problem that the text cannot extract the core word or phrase for causality. (2) A wide range of causal categories is extracted by means of the subscript.
Disadvantages: (1) In terms of clause unit, lots of irrelevant words are introduced in causal text. The sentence is only cut into two parts according to the causal conjunction in most text, the boundary is too loose and the essence of causal extraction is lost. (2) Lots of new causal tags are introduced due to the subscripts, the tags of sentences are distributed unevenly, which makes it difficult to train the feature.
Event sequence block (Fu et al., 2011) With tags "C", "E" and "N" to represent "Cause", "Effect" and "None", takes the event indicator as the representative of the overall event, without labeling other texts in the event. The "BIO" tag is used to distinguish the boundary of causal event blocks which divides the different causal pairs, solving the multiple causality problem. Core word (Xu et al., 2020): The method uses "C"(cause), "E"(effect) and "O" (Other) for causal tag, and core word for labeling unit. It solves multiple causality according to the number of tags "C", "E" and the embedded-causality by the means of "make rules to specify existing tag"

Three steps of labeling rules
According to the existing labeling methods, we summarize the labeling rules for the three steps. By setting the causality semantic tags, we can determine whether the entity is a cause or an effect, and then divide the entity boundary range according to the causality labeling unit. Finally, we can deal with some special causality by setting the label, unit division and introducing other methods.  Figure 4: Example for the causal labeling method of core word. The sentence is three-cause and oneeffect Setting the causal semantic label In different studies, the causality semantic tag is roughly same. The first letters "C" and "E" of the words "cause" and "effect", the abbreviations "Emb", "CC" of "Embedded-Causality", "causal connectives" and other causal words are as the causal semantic tags.
Establishing the causal labeling units There are three causal labeling units: core word (Xu et al., 2020), phrase (Dasgupta et al., 2018) and clause . There are differences in the labeling sequence and completeness of causal semantics according to the different causal labeling units.

Summary and analysis of the existing labeling methods
We comprehensively summarized and analyze existing four labeling methods from multiple angles in the Table 4.3. The more tag complex is, the clearer of tag semantics and the wider scope of causality is. The stricter for labeling boundary is, the less complete of causal semantic expression is, and less no-causal semantic is. The core word method sacrifices the clarity and completeness of causality for the simplicity and purity of labeling. On the contrary, the clause subscript uses complex tags and loose boundary division to obtain a wide range of causal categories and the complete causal semantics, but reduces the purity of causal semantics. However, the method of phrase boundary is s a compromise.

The experiment of optimal causal sequence labeling method
According to the Section 4, we conclude that most methods are balanced in tag feature complexity and causal scope breadth, so there is no perfect label method. We adopt the advantages and take out the disadvantages of the existing labeling methods, summarize several candidate labeling sequences, exploring the optimal label method through experiments and give the suggestions for method selection.

Experiment data
Due to the complexity of causal tags in the clause subscript method, it is difficult to train the feature, we adopt the tags in the phrase boundary method. Since "clause"unit is too extensive, losing the original meaning for causality extraction, the units of "core word" and "phrase" are selected. There are controversies in phrase: (1) whether the attributive or adjective modifying the phrase needs to be labeled. (2) whether function word such as "the" before the phrases needs to be labeled. (3) there is a dispute over marking the former part, the latter part or the whole "of". According to the controversies   Table 5: Summary for four labeling methods. The tag semantic clarity refers to whether the tag explicitly expresses the causality semantics such as embedded-causality, causal correspondence, etc. The "CW", "PB", "CS", "ESB" are the abbreviations of the core word, phrase boundary, clause subscript and event sequence block. The degree of comparison goes deeper to the prototype in phrase, several candidate labeling sequences are proposed based on phrase boundary method. The details of are in Appendix A.  In addition, the SCIFI dataset described in Section 3.1 is used as the basic experimental data, 7 candidate labeling sequences are labeled respectively to form 7 sets of experimental data with the same text and different labels, forming the new dataset E-SCIFI(Extended SCIFI) for labeling sequence method.

Experiment content
We conduct labeling methods and comparative experiments which includes the 7 candidate labeling methods in Section 5.1. The basic model Bi-LSTM+CRF proposed in Huang et al. (2015) is selected to reduce the influence of the model itself on the experimental results. For the parameter setting, all the word vector dimension in our paper is 300. We train the model for 200 epochs with the learning rate of 0.5. And the optimizer is Adam.
Fine-grained evaluation criteria Take the sentence as unit, judge whether the causal extraction is correct according to the labeling sequence. If the labels of all the words in the sentence are correct, the causal relation extraction of the sentence is correct, including as follows: (1) the words and boundary of cause and effect extract correctly. (2) the causality is in the right direction. (3) cause and effect are extracted simultaneously. (4) In the case that multiple causality meet the above three conditions at the same time, the extraction is correct. The accuracy of sequence labeling is calculated as follows: where m is the number of correct sentences, and M is the total number of sentences.  Coarse-grained evaluation criteria Take the label as unit, evaluate the value of F1 for each tag. The weighted sum of "B-X" and "I-X" tags in proportion to the number is used to measure the X-F1.
Where n B , n I are the number of tags "B-X" and "I-X", f B , f I are the corresponding F1 value.

Experimental results and analysis Label Methods
Overall Accuracy(%)  Causal unit analysis The accuracy for core word is higher than phrase, the number and type of label are in a small amount in core word method, the features are simple and easy to learn. "Emb" tag analysis: The number of tag "Emb" in the data is too small to train accurate features, which reduces the accuracy of whole experiment, reducing 2.09% compared with the core word method without "Emb". It can be seen that the embedded-causality words are clearly marked semantically by introducing new tags, but the experimental effect is reduced. Therefore, "making rules to specify existing label" can be used as a compromise to sacrifice semantic clarity of tags for training feature accuracy.
Causal tag analysis As the largest number of entities labelled "O", the feature is obvious and the F1 value is the highest. In the most label methods, the F1 value of tag "E" is higher than that of tag "C", the ability of the model to identify the effect entity is higher than that of the cause entity.
Boundary tag analysis In all phrase labeling methods ("Emb" tag of phrase(of) except), the F1 value of "B-X" tag is higher than that of "I-X", so the ability to identify the beginning boundary of causal phrase is higher than that of the end(middle). Most of the phrase boundary disputes are over the setting of the phrase end boundary, while the phrase start boundary disputes are less.
Phrase labeling analysis (1) with articles is better than without articles: the article feature (the, an, a corresponding word vector) can explicitly mark the starting position of causal phrase even if it does not have causal semantics. (2) label part of "of" is better than the whole "of": labeling the whole "of" has less controversy in which part of "of", however, the text of causal phrase is too long, introducing boundary division disputes and adding more useless feature. (3) continuous labeling sequence is better than discontinuous sequence: The "phrase(of)" method has discontinuity in the labeling sequence, which destroys the continuity of causal entity and increases the difficulty of extracting features.
With the combine of the accuracy of labeling sequence and the F1 value of the tag, and only the experimental effect was used as the criterion, the optimal labeling method was ranked as: core word, core word(Emb), phrase(article), phrase, phrase(articles and of), phrase(articles after of), phrase(of).

Analysis of optimal causal sequence labeling method
If the research has no special requirements, the labeling rank in Section 5.3 can be used as the priority of the selection of causal labeling method. However, in fact, the selection of labeling method still needs to be combined with the research purpose and experimental data.
Under the situation that there is no causal semantic integrity requirement, and the embedded-causality extraction is not the focus task, the "core word" labeling method is preferred. However, if the experiment focuses on the study of embedded-causality, "core word (Emb)" should be selected.
If the research has certain requirements on causal semantic integrity, the optimal labeling method should be selected from the phrase unit. If there is no requirement for causal purity, the "phrase(article)" labeling method is preferred, otherwise, phrase method can be adopted by sacrificing a little accuracy.

Conclusions
Causal relationship extraction is still a new research field with no publicly evaluated dataset and fixed labeling method, which are the basis of all research and one of the important factors to hinder the progress of causality extraction. We summarize and analyzes the defects in the construction of the experimental data and the disputes over the labeling method for causality in all aspects, so as to serve as a reference for the further research on causality. We also explore the optimal labeling method of causal sequence through experiments, and puts forward suggestions for the selection of labeling methods. In addition, we explore the existing research on causality from multiple perspectives, summarizing the causality extraction related concepts (See the Appendix B for the detailed table).

A Candidate labeling sequences
In our experiments, the details of proposed candidate labeling sequences are as follows: • Core word: The causal labeling method of core word in Section 4.1.
• Core word: The core word methods with new tag "Emb" to label embedded-causality.
• Phrase: Label the attributive adjective and part of "of", ignore the articles.
• Phrase(articles): Label the attributive adjective, articles and part of the "of".
• Phrase(of): Ignore all articles, label the whole "of", with a causal break in the labeling sequence.
• Phrase(articles after of): Ignore the articles before "of" and label the whole "of".
• Phrase(articles and of): Label all articles and the whole "of".