PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation

De-identification is the task of identifying protected health information (PHI) in the clinical text. Existing neural de-identification models often fail to generalize to a new dataset. We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue. PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively. Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6%) on cross-dataset test setting. We also discuss how much augmentation to use and how each augmentation method influences the performance.


Introduction
Clinical text in electronic health records (EHRs) often contain sensitive information. In the United States, Health Insurance Portability and Accountability Act (HIPPA) 2 requires that protected health information (PHI) (e.g., name, street address, phone number) must be removed before EHRs are shared for secondary uses such as clinical research (Meystre et al., 2014).
The task of identifying and removing PHI from clinical texts is referred as de-identification. Although many neural de-idenfication models such as LSTM-based (Dernoncourt et al., 2017;Liu et al., 2017;Jiang et al., 2017;Khin et al., 2018) and BERT-based (Alsentzer et al., 2019;Tang et al., 2019) have achieved very promising performance, identifying PHI still remains challenging in the real-world scenario: even well-trained models often fail to generalize to a new dataset. For example, we conduct cross-dataset test on i2b2 2006 and i2b2 2014 de-identification challenge datasets 3 (i.e., train a widely-used de-identification model Neu-roNER (Dernoncourt et al., 2017) on one dataset and test it on the other one). The result in Figure 1 shows that model's performance (F1-score) on the new dataset decreases up to 33% compared to the original test set. The poor generalization issue on de-identification is also reported in previous studies (Stubbs et al., 2017;Yang et al., 2019;Johnson et al., 2020;Hartman et al., 2020).
To explore what factors lead to poor generalization, we sample some error examples and find that the model might focus too much on specific entities and does not really learn language patterns well. For example, in Figure 2, given a sentence "She met Washington in the Ohio Hospital", the model tends to recognize the entity "Washington" as the "Location" instead of the "Name" if "Washington" appears as "Location" in the training many times. Such cases appear more frequently in a new testing set, thus leading to poor generalization.
To prevent the model overfitting on specific cases and encourage it to learn general language patterns, one possible way is to enlarge training data (Yang et al., 2019). However, clinical texts are usually difficult to obtain, not to mention the requirement of tremendous expert effort for annotations (Yue et al., 2020). To solve this, we introduce our data augmentation method PHICON, which consists of PHI augmentation and Context augmentation. Specifically, PHI augmentation replaces the original PHI entity in the training set with a same type named-entity sampled from external sources (such as Wikipedia). For example, in Figure 2, "Ohio Hospital" is replaced by an randomly-sampled "Hospital" entity "Alaska Health Center". In terms of context aug- mentation, we randomly replace or insert some non-stop words (e.g., verb, adverb) in sentences to create new sentences as an example shown in Figure 2. The augmented data does not change the meaning of original sentences but increase the diversity of the data. It can better help the model to learn contextual patterns and prevent the model focusing on specific PHI entities. Data augmentation is widely used in many NLP tasks (Xie et al., 2017;Ratner et al., 2017;Kobayashi, 2018;Yu et al., 2018;Bodapati et al., 2019;Wei and Zou, 2019) to improve model's robustness and generalizability. However, to the best of our knowledge, no work explores its potential in the clinical text de-identification task. We test two LSTM-based models: NeuroNER (Dernoncourt et al., 2017), DeepAffix (Yadav et al., 2018) and one BERT-based (Devlin et al., 2019) model: ClinicalBERT (Alsentzer et al., 2019) with our PHICON. Cross-dataset evaluations on i2b2 2006 dataset and i2b2 2014 dataset show that PH-ICON can boost the models' generalization performance up to 8.6% in terms of F1-score. We also discuss how much augmentation we need and conduct the ablation study to explore the effect of PHI augmentation and context augmentation. To summarize, our PHICON is simple yet effective and can be used together with any existing machine learning-based de-identification systems to improve their generalizability on new datasets.  much on specific entities (e.g., recognizing "Washington" as "Location") but fail to learn general language patterns (e.g., "met" is not usually followed by a "Location" entity but a "Name" entity instead). Consequently, such unseen or Out-Of-Vocabulary PHI entities might be hard to be identified correctly, thus leading to lower performance. To help models better identify these unseen PHI entities, we may encourage models to learn contextual patterns or linguistic characteristics and prevent models focusing too much on specific PHI tokens. PHI Augmentation. To achieve this goal, we first introduce PHI augmentation: create more training corpora by replacing original PHI entities in the sentence with other named-entities of the same PHI type. For example, in Figure 2, "Washington" is replaced by a randomly-sampled Name entity "William" and "Ohio Hospital" is replaced by an randomly-sampled Hospital entity "Alaska Health Center". We construct 11 candidate lists for sampling different PHI types. The lists are either obtained by scraping the online web sources (e.g., Wikipedia Lists) or by randomly generating based on predefined regular expressions (the number and the source of each candidate list is shown in Table 1). Context Augmentation. To further help models focus on contextual patterns and reduce overfitting, inspired by previous work (Wei and Zou, 2019), we leverage two text editing techniques: synonym replacement (SR) and random insertion (RI) to modify background context for data augmentation (examples are shown in Figure 2). Specifically, SR is implemented by finding four types of non-stopping words (adjectives, verbs, adverbs and nouns) in sentences, and then replacing them with synonyms from WordNet (Fellbaum and Miller, 1998). RI is implemented by inserting random adverbs in front of verbs and adjectives in sentences, as well as inserting random adjectives in front of nouns in sentences.   For each sentence containing PHI entities in the corpus, we can apply both PHI augmentation and Context augmentation to obtain the augmented data D aug . We can run α times (by setting different random seeds) to obtain different sizes of augmented data (e.g., α = 2 means augmenting the original dataset twice). Though with the α increases, we can obtain larger augmented training corpora, it may also bring much noise. We recommend a small value for α (See more discussions in Section 4.2). Then we merge the D aug with the original dataset D to form the final dataset D new for training:

PHICON
In summary, PHICON can significantly increase the diversity of training data without involving more labeling efforts. The augmented data can increase data diversity and enrich contextual patterns, which could prevent the model focusing too much on specific PHI entities and encourage it to learn general language patterns.

Datasets
We adopt two widely-used de-identification datasets: i2b2 2006 dataset and i2b2 2014 dataset, and split them into training, validation and testing set with proportion of 7:1:2, based on notes number. We remove low frequency (occur less than 20 times) PHI types from the datasets. To avoid PHI inconsistency between the two datasets, we map and merge some fine-grained level PHI types into a coarse-grained level type, and finally preserve five PHI categories: Name (Doctor, Patient, Username), Location (Hospital, Location, Zip, Organization), Date, ID (ID, Medical Record), Contact (Phone). The statistics of the datasets are shown in Table 2.

Does PHICON improve generalization?
In our preliminary experiments, we find that poor generalization tends to be more severe when the training set size is small. Thus, we consider the following training set fractions (%): {20, 40, 60, 80, 100} and we set the augmentation factor α = 2 considering both effectiveness and time-efficiency (See the influence of α in Section 4.2). Table 3 shows the overall results, and interesting findings include: (1) PHICON improves the generalizability of each de-identification model under different training sizes consistently. The results are not surprising as both PHI augmentation and context augmentation increase linguistic richness and enable models to focus more on language patterns, so as to help to train more generalized models.
(2) In general, the performance boost is large when the training data size is relatively small. This is because PHICON plays larger role at the lowresource case as it can significantly increase data diversity, language patterns, and linguistic richness.
(3) The performance boost on the BERT-based model is less obvious than that on LSTM-based models. Since ClinicalBERT has already been pretrained on large-scale corpus: MIMIC-III clinical notes (Johnson et al., 2016). It is reasonable that the augmented data does not lead to large boost on ClinicalBERT. But there is still significant boost when training data size is relatively small.
(4) The boost on the setting "2006→2014" is larger than that in the setting "2014→2006". Because i2b2 2014 dataset has more data and more comprehensive PHI patterns than i2b2 2006 dataset. Data augmentation is usually more effective when the training set size is smaller (Wei and Zou, 2019). Improvement for each PHI category. To further understand PHICON, we show the performance ( "2014→2006") of the base model NeuroNER and NeuroNER + PHICON on each category of PHI in Figure 3. Firstly, we can see that when the training data is relatively small (e.g., 20%), the  improvement on each PHI category is generally significant. With the training set size increases, the contribution of the augmented data becomes small. However, for the PHI categories that have less training data in the dataset (e.g., Location and ID; See Table 2), PHICON still contributes much improvement. Thus, we conclude that PHICON may be more helpful in the low-resource training data case.

How much augmentation?
In this section, we discuss the influence of the augmentation factor, α, on the cross-dataset test performance. In Figure 4, we report the performance on dev set based on the model NeuroNER for α = {1, 2, 3, 4}. In the first setting ("2006→2014"), we can see the performance is steadily boosted with the increase of the factor α; while in the second setting ("2014→2006"), the performance first goes up and then drops down. This difference might be caused by the data size of the two datasets (2014 dataset is larger). When the corpus is large, enlarging the augmentation factor might not lead to better performance, as the real data may have already covered very diverse language patterns. In addition, more augmented data might bring some   noise, which could decrease the performance. In terms of time efficiency, when α is increased by 1, the training time would roughly double if we set the same epoch number. So considering effectiveness, efficiency and data size, we recommend to set α a relative small value (e.g., 2) in the real application.

Ablation Study
In this section, we perform an ablation study on PHICON based on NeuroNER to explore the effect of each component: PHI augmentation and context augmentation. Table 4 shows that the two components of PHICON both contribute to boosting model generalization. Performance boost from PHI augmentation is obvious than context augmentation, i.e., PHI augmentation plays a major role. When combining both, PHICON results in larger boost than each of them.

Conclusion
In this paper, we explore the generalization issue on clinical text de-identification task. We propose a data augmentation method named PHICON that augments both PHI and context to boost model generalization. The augmented data can increase data diversity and enrich contextual patterns in training data, which may prevent the model overfitting on specific PHI entities and encourage it to focus more on language patterns. Experimental results demonstrate that our PHICON can help improve models' generalizability, especially in the low-resource training case (i.e., the size of the original training set is small). We also discuss how much augmentation to use and how each augmentation method influences the performance. In the future research, we will explore more advanced data augmentation techniques for improving the de-identification models' generalization performance.