De-identifying Free Text of Japanese Dummy Electronic Health Records

A new law was established in Japan to promote utilization of EHRs for research and developments, while de-identification is required to use EHRs. However, studies of automatic de-identification in the healthcare domain is not active for Japanese language, no de-identification tool available in practical performance for Japanese medical domains, as far as we know. Previous work shows that rule-based methods are still effective, while deep learning methods are reported to be better recently. In order to implement and evaluate a de-identification tool in a practical level, we implemented three methods, rule-based, CRF, and LSTM. We prepared three datasets of pseudo EHRs with de-identification tags manually annotated. These datasets are derived from shared task data to compare with previous work, and our new data to increase training data. Our result shows that our LSTM-based method is better and robust, which leads to our future work that plans to apply our system to actual de-identification tasks in hospitals.


Introduction
Recently, healthcare data is getting increased both in companies and government. Especially, utilization of Electronic Health Records (EHRs) is one of the most important task in the healthcare domain. While it is required to de-identify EHRs to protect personal information, automatic deidentification of EHRs has not been studied sufficiently for the Japanese language.
Like other countries, there are new laws for medical data treatments established in Japan. "Act Regarding Anonymized Medical Data to Contribute to Research and Development in the Medical Field" was established in 2018. This law allows specific third party institute to handle EHRs. As commercial and non-commercial health data is already increasing in recent years 1 , this law promotes more health data to be utilized. At the same time, developers are required to de-identify personal information. "Personal Information Protection Act" was established in 2017, which requires EHRs to be handled more strictly than other personal information. This law defines personal identification codes including individual numbers (e.g. health insurance card, driver license card, and personal number), biometric information (e.g. finger print, DNA, voice, and appearance), and information of disabilities.
De-identification of structured data in EHRs is easier than that of unstructured data, because it is straightforward to apply de-identification methods e.g. k-anonymization (Latanya, 2002).
While rule-based, SVM  and CRF (Lafferty, McCallum, & Pereira, 2001) were often used in these previous NER tasks, deep neural network model has shown better results recently. However, rule-based methods are still often better than machine learning methods, especially when there is not enough data, e.g. the best system in MedNLPDoc (Aramaki, Morita, Kano, & Ohkuma, Overview of the NTCIR-12 MedNLPDoc Task, 2016). The aim of the MedNLPDoc task was to infer ICD Codes of diagnosis from Japanese EHRs.
In this paper, we focus on de-identification of free text of EHRs written in the Japanese language. We compare three methods, rule, CRF and LSTM based, using three datasets that are derived from EHRs and discharge summaries.
We follow the MedNLP-1's standard of person information which require to de-identify "age", "hospital", "sex" and "time".

Methods
We used the Japanese morphological analyzer kuromoji 2 with our customized dictionary, as same as the best result team (Sakishita & Kano, 2016) in the MedNLPDoc task. We implemented three methods as described below: rule-based, CRF-based, and LSTM-based.

Rule-based Method
Unfortunately, details and implementation of the best method of the MedNLP1 de-identification task (Imaichi, Yanase, & Niwa, 2013) are not publicly available. We implemented our own rulebased program based on their descriptions in their 2 https://www.atilika.com/en/kuromoji/ paper. Our rules are shown below. For a target word x,  If the detailed POS is number, apply rules in Table 1 hospital (hospital name)  If one of following keywords appeared, then mark as hospital: 近医 (a near clinic or hospital), 当院 (this clinic or hospital), 同 院 (same clinic or hospital)  If POS is noun and detailed-POS is not nonautonomous word, or x is either "•", "◯", "▲" or "■" (these symbols are used for manual de-identification due to the datasets are pseudo EHRs), then if suffix of x is one of following keywords, mark as hospital: 病 院 (hospital or clinic), クリニック(clinic), 医院 (clinic) sex  If either 男性 (man), 女性 (woman), men, women, man, woman, then mark as sex time (subject's time with its suffix)  If detailed-POS is number and x is concatenation of four or two, or one digit number, slash and two-digit number (e.g. yyyy/mm or mm/dd) then mark as time  If detailed-POS of x is number and followed with either 歳 (old), 才 (old), 代 ('s), mark as time  If it is further followed with either "よ り"," まで "," 前半"," 後半"," 以 上"," 以下"," 時"," 頃","ごろ","こ ろ","から", "前半から", "後半から", "頃から", "ごろから","ころから" and so on include these words in the marked time

CRF-based Method
As a classic machine learning baseline method of series labelling, we employed CRF. Many teams of the MedNLP1 de-identification task used CRF, including the second best team and the baseline system. We used the mallet library 3 for our CRF implementation. We defined five training features for each token as follows: part-of-speech (POS), detailed POS, character type (Hiragana, Katakana, Kanji, Number,), whether the token is included in our user dictionary or not, and a binary feature whether the token is beginning of sentence or not.

LSTM-based Method
We used a machine learning method that combines bi-LSTM and CRF using character-based and word-based embedding, originally suggested by other group (Misawa, Taniguchi, Yasuhiro, & Ohkuma, 2017). In this method, both characters and words are embedded into feature vectors. Then a bi-LSTM is trained using these feature vectors. Finally, a CRF is trained using the output of the bi-LSTM, using character level tags. The original method uses a skip-gram model to embed words and characters by seven years of Mainichi newspaper articles of almost 500 million words. However, we did not use skip-gram model but GloVe 4 , because GloVe is more effective than skip-gram (Pennington, Socher, & Manning, 2014). We used existing word vectors 5 instead of the pretraining in the original method. Our training and prediction is word based while the original method is character based. Our implementation is based on an open source API 6 .

Data
Our dataset is derived from two different sources. We used the MedNLP-1 de-identification task data to compare with previous work. This data includes pseudo EHRs of 50 patients. Although there were training data and test data provided, the test data is not publicly available now, which makes direct comparison with previous work impossible. However, both training and test data are written by the same writer and was originally one piece of data. Therefore, we assume that the training data can be regarded as almost same as the test data in their characteristics.
Another source is our dummy EHRs. We built our own dummy EHRs of 32 patients, assuming that the patients are hospitalized. Documents of our dummy EHRs were written by medical professionals (doctors). We added manual annotations for de-identification following a guideline of the MedNLP-1 task. These annotations were assigned by ourselves.
All of these data are assigned five types of deidentification tag; age, hospital, sex, time and person. MedNLP-1 data includes 2244 sentences and our dummy EHRs include 8327 sentences. Writers hold doctor's licenses in both sources, assuming fake patients to describe pseudo medical records. However, descriptions are not similar between the two sources, probably because of the difference of the writers.

Evaluation method
Our evaluation method followed MedNLP-1, using the IOB2 tagging (Tjong & Jorn, 1999). We applied four hold cross validation, while the rulebased method does not require training data. From the two sources described above, we derived three datasets: MedNLP-1, dummy EHRs, and both of MedNLP1 and dummy EHRs (mixture). We trained CRF and LSTM by this mixture data. We divided each data source for our cross-fold validation to hold the same balance of these two sources. Our evaluation metrics is strict match of named entities. Table 2 shows the evaluation results. The best F1 score is by the rule-based method. This is because the rules were tuned for the MedNLP-1 data. In both of datasets, CRF and LSTM are not significantly different from the rule-based one. LSTM performed best for the hospital tag and the time tag, probably because they might have typical patterns of less variations. Total occurrence of sex is very small, person is zero, in the MedNLP-1 dataset.

Result of Dummy-EHR dataset
The result is shown at Table 3. The best score is performed by LSTM trained by the mixture dataset. Despite the data size is four times larger than that of MedNLP-1, the result is a little better. Regarding CRF, training with mixture dataset is worse than the dummy her dataset only. This is not true for LSTM, which shows better results when trained by mixture dataset.

Overall
We trained CRF and LSTM by the mixture dataset and evaluated on MedNLP-1, dummy-EHR and mixture dataset individually. These results are shown in Table 4 and Table. Regarding CRF, there is 26 point difference in average between evaluations with MedNLP-1 and dummy-EHR datasets. On the other hand, LSTM shows 7 point difference in average. These results suggest that the datasets are quite different, but LSTM absorbed these differences well.

Conclusion and Future Work
We implemented three different de-identification methods for Japanese EHRs. We applied these    methods to three datasets derived from two different pseudo EHR sources with deidentification tags manually annotated. Our results show that LSTM is better than other methods also shows robustness between different sources compared with CRF. Machine learning methods could extract named entities of de-identification comparable to the rule based method that is manually tuned to specific target data. However, machine learning method is still weak for expressions with low occurrences. Combination of LSTM and rule-based method could be a future work.
Because the current performance is enough high among publicly available Japanese deidentification tools, we plan to apply our system to actual de-identification tasks in hospitals.
Although it is still difficult to make real EHRs publicly available, we could use our large amount of EHRs inside our hospitals. Increasing the annotated dataset for such internal usage would be another future work.

Acknowledgement
This work was partially supported by Japanese Health Labour Sciences Research Grant and JST CREST.