Counterfactual Generator: A Weakly-Supervised Method for Named Entity Recognition

Past progress on neural models has proven that named entity recognition is no longer a problem if we have enough labeled data. However, collecting enough data and annotating them are labor-intensive, time-consuming, and expensive. In this paper, we decompose the sentence into two parts: entity and context, and rethink the relationship between them and model performance from a causal perspective. Based on this, we propose the Counterfactual Generator, which generates counterfactual examples by the interventions on the existing observational examples to enhance the original dataset. Experiments across three datasets show that our method improves the generalization ability of models under limited observational examples. Besides, we provide a theoretical foundation by using a structural causal model to explore the spurious correlations be-tween input features and output labels. We investigate the causal effects of entity or context on model performance under both conditions: the non-augmented and the augmented. Inter-estingly, we ﬁnd that the non-spurious correlations are more located in entity representation rather than context representation. As a result, our method eliminates part of the spurious correlations between context representation and output labels. The code is available at https://github.com/xijiz/cfgen .


Introduction
The natural language processing community has witnessed the paradigm shift from small data to big data, such as transformer (Vaswani et al., 2017) and its successors. It is not surprising that machine learning methods can easily surpass human performance if sufficient data is available (Wang et al., 2018). However, data acquisition is a challenging task for some special domains. For example,  Figure 3 medical concept normalization, a basic subtask of named entity recognition (NER) in the medical area, has always been troubled by lack of enough Electronic Health Records due to the privacy protection. Small data with selection biases (Torralba and Efros, 2011) often induce the poor performance of machine learning models on inputs whose distribution is different from that of training data, which yet seems trivial to humans. The same issues are also mentioned in terms like dataset bias, model robustness, and real understanding. In natural language inference, models trained on hypotheses-only (vs hypotheses-premises) can outperform a majority-class baseline (Poliak et al., 2018;Gururangan et al., 2018). In reading comprehension, models trained on question-only or passage-only (vs question-passage) still achieve high accuracy (Kaushik and Lipton, 2018), models predicted on a broken question (vs original question) still make the same correct prediction (Feng et al., 2018a).
The key challenge behind this phenomenon is caused by spurious correlations of statistical learning. Spurious correlations can be vividly explained by an example in computer vision (Arjovsky et al., 2019): If we consider an image dataset of cows and camels in their natural habitat, a classifier trained on this dataset will establish spurious correlations between the output labels (cows, camels) and the landscape of the image (green pastures, deserts). As a result, an image of cows taken on sandy beaches makes the classifier make a wrong prediction. In this background, we could not help thinking that is there any way to eliminate spurious correlations except more data annotated by humans? From a causal perspective, spurious correlations are caused by confounding factors rather than a direct or indirect causal path. If we directly intervene on the precursor variable in spurious correlations to create counterfactual data, we can eliminate the impact of spurious correlations in models to a certain degree (Volodin et al., 2020).
In this paper, based on the above analysis, we mainly focus on exploring the spurious correlations in NER from a causal perspective. We decompose the sentence into two different parts: entity and context, and rethink the relationship between them and the generalization ability of the NER model. Considering the sentence "John lives in New York", we observe that the location entity "New York" and the context "John lives in" are highly correlated but are not causal to each other. In other words, we can intervene on the location entity to set it to another different location entity without destroying the sentence correctness at the grammatical level. Therefore, we propose the Counterfactual Generator, which generates new counterfactual examples by the interventions on the existing observational examples. Our method requires neither an additional entity dictionary nor a similar domain dataset. Figure 1 demonstrates the intervention process for observational examples. We utilize new counterfactual examples to enhance the existing observational examples. Experiments show that our method improves the generalization ability under limited observational examples. Before the enhancement, we find that the model performance is mainly driven by entity representation. After the enhancement, the importance of entity representation increases in most cases, and generalization ability improves in all cases. We conclude that the non-spurious correlations between input features and output labels do locate in context representation (Lin et al., 2020), but the previous two phenomena show that they are more located in entity representation.
In summary, our work have the following contributions: • We provide a theoretical foundation from a causal perspective to describe the mechanism of the NER model inference and explore the spurious correlations between input features and output labels.
• Based on the interventions on the entity, we propose a weakly-supervised method for named entity recognition under limited observational examples. Experiments across three NER datasets demonstrate that our method boosts model performance.

Counterfactual Generator
In this section, we firstly define the NER problem. Then, we present our method by introducing a structural causal model to describe the mechanism of the NER model inference.

Task Definition
In this paper, we regard named entity recognition as a sequence labeling problem. In general, we let x = (x 1 , x 2 , ..., x n ) to denote a sequence of tokens. For each token x i , we have a label y i where y i ∈ Y. For example, Y can be {O, B-Diagnosis, I-Diagnosis} in the medical area. The possible labels come from BIO tagging schema for labeling tokens from the sentence. For each sentence, we have an entity set E that contains all entities in this sentence. Finally, we have a labeled dataset D = {(x, y)}.

Causal Model
What determines a certain segment in a sentence to be an entity mention? Why is this entity mention to be a diagnosis entity? These are causal questions because they require some information about the generation process of the data rather than observational data alone (Pearl et al., 2009). Observational data with selection biases often gives rise to the problem of the spurious correlations that results in low generalization ability of the NER model under limited data. For this problem, causality can provide an in-depth view of its essence.
To investigate the causal relationship between the NER model and data clearly, we introduce a Structural Causal Model (SCM) (Judea, 2000) to describe the mechanism of the inference process of the NER model. SCM is expressed visually by using directed acyclic graphs (DAGs). In the graph, vertices are random variables, and directed edges represent direct causation from variable A to variable B. Here, for simplifying the problem, we decompose the sentence into the two variables: entity E and context C. As shown in Figure 2(a), we assume the following SCM: where G is a confounding variable that influences the generation of both entity E and context C, X is the input example that is generated by E and C, Y is the evaluation result (the F 1 score) of the NER model, and U * represents the unmeasured variable.
Causal effects help us better understand the causal relationship in a system. The basic method of estimating causal effects is simulating interventions in SCM. We use a mathematical operator do(v 0 ) to simulate physical interventions by fixing the value of a variable v as v 0 . For example, in order to simulate an intervention do(c 0 ) in the structural causal model M , we fix the variable C to c 0 as shown in Figure 2(b), denoted as: This intervention blocks the influence of the variable G on the variable C. The post-intervention distribution P (y|do(c 0 )) gives the proportion of individual that would attain response in level Y = y under the hypothetical situation in which treatment C = c 0 is administered uniformly to the population (Pearl et al., 2009). Here, we have P (y|do(c 0 )) = 1. More proof information can be found in the appendix.
A way to estimate the treatment effect or causal effect is to measure the average difference of the former distribution by using the expectation E, called Average Causal Effect (ACE), denoted as: where c 0 and c are the intervened value and the original value. Similarly, in order to estimate the causal effects of the variable E on the variable Y , we can also intervene on the variable E, denoted as do(e 0 ) (See Figure 2(c)).

Method
Our method tries to automatically replace an entity in an observational example with another different entity for creating a new counterfactual example. These counterfactual examples help our NER model deal with spurious correlations on limited observational examples and learn more invariant and stable features. As shown in Figure 3, our method mainly has the following three parts:

1) Set Preparation
The core idea of our method is finding a different entity for intervening on an entity in the observational example. However, finding a new entity set in a specific domain needs human efforts to collect entities, which has no difference Observational example: "Papillary adenocarcinoma" was admitted to the hospital. Since the onset of the disease, the patient has normal appetite, a clear mind, acceptable spirit, acceptable sleep, normal stool, normal urine, and no significant change in body weight.
Counterfactual example: "Cataract" was admitted to the hospital. Since the onset of the disease, the patient has normal appetite, a clear mind, acceptable spirit, acceptable sleep, normal stool, normal urine, and no significant change in body weight.

Discriminator
If the discriminator can recognize the replaced entity (cataract) correctly, we regard this counterfactual example as a reasonable example and add it into the dataset.

2)
3) Figure 3: An example of the workflow of the Counterfactual Generator on the medical dataset. 1) We prepare the entity sets by the entity type (diagnosis) from the original dataset. 2) We randomly choose an entity (papillary adenocarcinoma) in the observational example and replace it with another different entity (cataract) from the entity set to form a new counterfactual example. It is noteworthy that the replaced entity and the candidate entity have the same entity type. 3) We send the counterfactual example to the discriminator for finding out the good one.

Augmented Dataset
from annotating more data. Hence, as shown in Figure 3(1), we adopt local entities as the entity set, which is extracted from the original dataset. For example, we iterate all observational examples in the training dataset to collect all diagnoses to form a diagnosis set E d .

2) Entity Intervention
We consider using the intervention on the entity to create new counterfactual examples. As shown in Figure 2(c) and Figure 3(2), for each observational example, we randomly select an entity e ∈ E with the entity type diagnosis, and replace it with another entity e ′ ∈ E d . Importantly, in order to preserve the linguistic correctness of the new counterfactual example, we keep the replaced entity and the candidate entity have the same entity type.

3) Example Discrimination
A key conflict is that not all counterfactual examples are correct or useful. We need a mechanism to discriminate which counterfactual example is good and make sure it does not bring in the noise. An intuitive solution is that we regard the NER model trained on the original dataset as the discriminator that provides well prior knowledge for inspecting our counterfactual examples. More specifically, as shown in Figure 3(3), the discriminator assists us to check whether the replaced entity is successfully predicted. If no, the counterfactual example will be discarded, otherwise, it will be outputted.
After executing all procedures, we have an aug-

Experiments
In this section, we mainly evaluate our method across three NER datasets, including two medical concept recognition datasets, and a conventional NER dataset.

Dataset
CNER 1 CNER is a Chinese clinical NER dataset in the CCKS-2019 challenge, including anatomy, disease, imaging examination, laboratory examination, drug, and operation. We extract 1650 available medical records from CNER, which contains entities of the disease type only.
IDiag For guaranteeing the diversity of the experimental data, we use Label Studio 2 to create a new medical NER dataset. We collect 12127 health record images from the hospital, which are converted into text paragraphs by optical character recognition (OCR). We hire some people to annotate diagnoses in these text paragraphs. To ensure the high quality of the dataset, we removed 539 data examples in the final dataset. It is worth noting that the distribution of IDiag, compared to CNER, has a big difference due to error text recognition from OCR.
CLUENER (Xu et al., 2020) In addition to the medical NER datasets, we also use a conventional NER dataset CLUENER released by CLUE organization 3 , which is a well-defined and finegrained dataset for named entity recognition in Chinese, including 10 categories like Person Name, Organzation, Book, etc. We extract 12090 available instances from this dataset.
All datasets are separately divided into three portions of 80% D 1 , 10% D 2 and 10% D 3 . D 1 is used to train models. D 2 is used to tune hyperparameters. D 3 is used to test the model performance (See Table 1).

Models
We conduct our experiments by using the following two classic models: LSTMTagger (Chiu and Nichols, 2016) and BERTTagger (Devlin et al., 2019). Our LSTMTagger consists of a bidirectional LSTM for encoding the input example and a dense layer (Tagger) for tagging all tokens. Each token is embedded by the pretrained word embedding (Song et al., 2018). Similarly, our BERTTagger consists of a pretrained BERT for encoding the input example and a dense layer (Tagger) for tagging all tokens.

Setup
In our experiments, we evaluate our method in two settings: NoAug and Aug. NoAug represents we train our models on the original dataset. Aug represents we train our models on the augmented dataset. We also set up five groups of experiments for each dataset. In each group, we only select N (100, 200, 300, 400, and 500) data from the train set to train models for evaluating performance under limited observational examples. At the same time, we always keep the dev set and test set unchanged in all experiments.
Additionally, we also conduct another experiment to calculate ACE of entity E or context C on the model performance Y . We design a special token [EMPTY] to replace tokens in entity E and tokens in context C separately, which is viewed as interventions. Once a token is replaced with [EMPTY] in an input example, all dimensions of token embedding will be zero. There are two intervened schemes corresponding to Figure 2(b) and Figure 2 (c) respectively, denoted as: • do(e 0 ) Replacing all tokens in entities with [EMPTY] and keeping context unchanged.
• do(c 0 ) Replacing all tokens in the input example with [EMPTY] except tokens in entities.

NER Evaluation
In this work, we mainly consider the performance at the entity level, which means the ground truth and the result have the same entity type and overlap boundaries are just taken into account. Hence, we use the relaxed metrics (Chinchor and Sundheim, 1993): micro-average F1 score (F1), precision (P), and recall (R). Besides, we also use the micro-average F1 score at the token level for the later causal analysis, which evaluates predictions only by tokens.

RI Index
We design an index to evaluate the Relative Importance (RI) between entity E and context C, denoted as: This index indicates that the higher the RI is, the more important the entity representation is during the process of the model inference. Otherwise, the representation of context is more important. For example, they have the same importance when RI = 0. We adopt two different ways (Entity Level and Token Level) to calculate the variable Y (the F 1 score) for attaining both the coarse-grained and the fine-grained results.

Main Results
As we can see, table 2 shows the comparisons between NoAug and Aug. We can see that our method achieves a huge improvement in almost all settings. For CNER, our method achieves the best results and yields a boost of 8.68% on average. Even    ). ACE C denotes the ACE when intervening on the variable C by c 0 . ACE E denotes the ACE when intervening on the variable E by e 0 . The RI index denotes the difference between ACE C and ACE E , which indicates the relative importance between entity representation and context representation. The higher the RI index, the more important the entity representation is during the process of the model inference.
for IDiag that has much noise, our method still achieves a huge gain of +5.2% improvement on average. We notice that our method only achieves a boost of 4.43% on average for CLUENER. Compared to the former two datasets, the reason for a lesser performance boost is that CLUENER contains more entity types than CNER and IDiag (10 vs 1, 1). Table 3 demonstrates the ACE results for different combinations between datasets, models, and evaluation ways in test set without augmentation. Interestingly, we check the RI index and observe that the importance of entity representation is always far greater than the importance of representation of context.

Discussion
In this section, we will firstly review our previous results, and then try to answer some potential questions that others may ask for a deep understanding of our method. Secondly, we will provide some real counterfactual examples to vividly illustrate our method. Finally, some limitations of our method that we have found so far are presented to guide future research. We hope these limitations can help readers understand our method better.

Analysis
Our method achieves significant improvements across three datasets, but there are always a few mysteries that haunt us. Q1: Do counterfactual examples change the causal effect between entity E and context C on model performance Y ? Q2: Why does this simple method perform well on small training data? Q3: Are those counterfactual examples that were created out of air correct or reasonable? A similar research also makes use of entity replacement to enhance the pretrained language model for improving zero-shot fact completion task. However, it considers the replaced entity as a negative sample (Xiong et al., 2020). Q4: Why can making use of a counterfactual example as a positive sample here still improve performance?

Answer for Q1
Since the RI index is huge under the circumstance that there is no data augmentation by using counterfactual examples, a question arises in our mind: how will the RI index change after using counterfactual examples to train NER model? Hence, we design another experiment to compare the RI changes between the non-augmented and the augmented. As shown in Figure 4, we observe that the RI index boosts in most cases after using counterfactual examples. Even for the RI index in those experimental groups which does not increase, it is almost always a positive number. Compared to context representation, entity representation dominates the performance of the NER model in most cases, especially for models based on BERT. This phenomenon suggests that the non-spurious correlations are more located in entity representation rather than context representation.

Answer for Q2
The essence of our method is forcing the disentanglement of entity E and context C in the input example, and to recombine them for generating new counterfactual examples. Before, we claim that the rationality of this operation is that entity and context are not a causal relationship. Our causal results show that the NER model will pay more attention to entities in the prediction process rather than context. Agarwal et al. also find that entity representation contributes more than context representation to system performance. Hence, to a certain extent, context representation may have more spurious correlations between the input features and output labels. The recombination of entity and context can increase the diversity of training 雷尼尔森林公园有非常完备的公路和服务设施，让你看尽高深峡谷下的碧绿色溪水、漫山遍野的七彩野花。 Mt. Rainier National Park has very complete roads and service facilities, allowing you to see the turquoise waters under the deep cany ons and the colorful wildflowers all over the mountains.

费沙岛有非常完备的公路和服务设施，让你看尽高深峡谷下的碧绿色溪 水、漫山遍野的七彩野花。
Fraser Island has very complete roads and service facilities, allowing you to see the turquoise waters under the deep canyons and the color ful wildflowers all over the mountains.

Counterfactual Example 2
Kiyomizu Temple has very complete roads and service facilities, allo wing you to see the turquoise waters under the deep canyons and the colorful wildflowers all over the mountains.

布鲁克林大桥、南街海港、总督岛和埃利斯岛等著名景点，全纽约最美的风景都将被你收入囊中！
Famous sights such as Brooklyn Bridge, South Street Seaport, Governors Island and Ellis Island, the most beautiful scenery in New York will be yours!

Entity Recognition
Model without Augmentation Model with Augmentation  (Xiong et al., 2020). On the other hand, these counterfactual examples are reasonable for named entity recognition because the task only focuses on better finding out entities and ignores the factual information. More importantly, our method preserves the linguistic correctness of the counterfactual example since the replaced entity and the candidate entity have the same entity type. These are the deep reason why the generalization ability of the NER model gets better after we treat the counterfactual example as a positive sample.

Case Study
As shown in Figure 5, we illustrate counterfactual generation and entity recognition, using the model LSTMTagger trained on the dataset CNER with training sample size N = 200. This illustration shows that our method can break the entanglement of the spurious features and the non-spurious features in the input example in the setting of limited observational examples.

Limitations
Although experimental results have shown the effectiveness of our method, our method can be further improved in terms of obtaining the most reasonable counterfactual examples. The capability of current discriminator is limited and the number of counterfactual examples regarded reasonable is large, which is not allowed especially for the large train set (N > 500). Although these examples increase the diversity of the combination between the entity and the context from the existing observational examples, there are lots of repeated text fragments in these examples. As far as we know, too many repeated text fragments would cause the CRF layer not to converge.

Related Works
We introduce the related works from two aspects:

Data Modification and Causality
Recently, there is an increasing number of research works about data modification for providing the interpretability of neural models. For example, Feng et al. and Gururangan et al. reveal that neural models are overconfident in their predictions by reducing words or sentences; Ebrahimi et al. also find that adversarial examples generated by some manipulations at a character-level or a word-level can trick neural classifier. Additionally, considerable attention has been paid to utilize data modification for augmenting dataset or providing a supervised signal in the training process. For example, a rule-based data augmentation protocol has been proposed to provide a compositional inductive bias (Andreas, 2020); Kaushik et al. create new counterfactual sentences by modifying the original sentence for ameliorating the harm of spurious correlations; Xiong et al. introduce the typeconstrained entity replacements to provide extra training signal for learning better factual knowledge. Interestingly, the above two points about data modification have high connections with causal inference because we can regard data modification as an intervention (Pearl et al., 2016)  Named Entity Recognition In this paragraph, we mainly focus on named entity recognition with limited supervision. One way to train the NER model with low-resource is dictionary-based distantly supervision (Fries et al., 2017;Shang et al., 2018;Yang et al., 2018; which builds a dictionary of entities for creating training data without too much effort. Few-shot learning is another promising way for training the NER model under limited supervision by transferring prior knowledge of the source domain to a new domain (Fritzler et al., 2019;Hou et al., 2019). There are also some works that focus on redefining NER as a different problem for reducing the need of hand-labeled training data. For example, Linking Rules (Safranchik et al., 2020) based on votes recognize entities through whether adjacent elements in a sequence belong to the same class; Lin et al. propose a new effective proxy of human explanation, "entity triggers", for encouraging label-efficient learning of NER models.

Conclusion
In this paper, we propose a weakly-supervised method from a causal perspective and provide the interpretability of our method with the structural causal model. Our method improves generalization ability under limited observational examples. Our causal experiments suggest the spurious correlations are more located in entity representation rather than context representation. Importantly, our method eliminates part of the spurious correlations between input features and output labels.
e := f E (g) c := f C (g) x := f X (e, c) where G is a confounding variable that influences the generation of both entity E and context C, X is the input example that is generated by E and C, and Y is the evaluation result (the F 1 score) of the NER model. For clarity, we omit the unmeasured variables.
We use a mathematical operator do(c 0 ) to simulates physical interventions by fixing the value of the variable c as c 0 (See Figure 2(b)). The postintervention distribution P (y|do(c 0 )) gives the proportion of individual that would attain response in level Y = y under the hypothetical situation in which treatment C = c 0 is administered uniformly to the population. In order to calculate P (y|do(c 0 )), based on Bayes' rule, we have P (y|do(c 0 )) = x P (y|do(c 0 ), x)P (x|do(c 0 )) = x P (y|do(x 0 ))P (x|do(c 0 )) For gauging the effect of context C on the input example X, we need to calculate P (x|do(c 0 )). However, there is a confounding variable G affects both entity E and context C. Fortunately, the variable E meets the backdoor criterion, and blocks the backdoor path C ← G → E → X. Using the adjustment formula, we have P (x|do(c 0 )) = e P (x|c 0 , e)P (e) In such condition, we have P (e) = 1 because our entity E = e is unchanged and unique for each input example. Besides, we also have P (x|do(c 0 )) = 1 though our input example X = x is changed but unique. Therefore, we have P (y|do(c 0 )) = 1 due to the certainty of the NER model for an input example. Similarly, as shown in Figure 2(c), we can also intervene on the variable E, denoted as do(e 0 ) and have the same postintervention distribution P (y|do(e 0 )) = 1.