Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts

Chinese word segmentation (CWS) and medical concept recognition are two fundamental tasks to process Chinese electronic medical records (EMRs) and play important roles in downstream tasks for understanding Chinese EMRs. One challenge to these tasks is the lack of medical domain datasets with high-quality annotations, especially medical-related tags that reveal the characteristics of Chinese EMRs. In this paper, we collected a Chinese EMR corpus, namely, ACEMR, with human annotations for Chinese word segmentation and EMR-related tags. On the ACEMR corpus, we run well-known models (i.e., BiLSTM, BERT, and ZEN) and existing state-of-the-art systems (e.g., WMSeg and TwASP) for CWS and medical concept recognition. Experimental results demonstrate the necessity of building a dedicated medical dataset and show that models that leverage extra resources achieve the best performance for both tasks, which provides certain guidance for future studies on model selection in the medical domain.


Introduction
Medical language processing (MLP), i.e., natural language processing (NLP) for the electronic medical record (EMR), has drawn significant attention over the past few decades (Rector et al., 1991;Friedman et al., 2004;Stevenson et al., 2012;Koleck et al., 2021). EMR normally records the entire process of a patient's examination, diagnosis, and treatment by clinicians in the hospital, and contains a large amount of medical information, which, if extracted properly, can be used to train a machine learning model as an automated tool for auxiliary diagnosis and treatment, forming the foundation of wise information technology of medicine. † Corresponding author. 1 The resources in this paper are released at https:// github.com/cuhksz-nlp/ACEMR. Chinese word segmentation (CWS) and medical concept recognition are two important and related tasks for Chinese MLP, which received much attention in previous studies (Xing et al., 2018;. The first task (i.e., CWS) aims to segment Chinese text (i.e., character sequence) into words, which is a necessary step for MLP because the meaning of many medical terms cannot be simply inferred by its component characters. For example, it is hard to infer the meaning of "扁桃体" (tonsil) from its components "扁" (flat), "桃" (peach), and "体" (body). The second task (i.e., medical concept recognition) assigns an EMR-related tag (e.g., Organism and Group) to the segmented words. It is worth noting that the medical concept in this paper includes not only the standard medical named entities but also other categories that are useful for medical text analysis. For example, "Time" is a medical concept that can be used to represent the disease history; "Probability" is a possible medical concept tag for "考虑" (consider), in EMR.
To perform CWS and medical concept recognition in Chinese EMR, researchers face a challenge that existing training data for the tasks is either publicly unavailable or of poor quality. Although one possible solution is to apply models trained in the general domain to the medical text, these models always fail to have good performance because there are many domain-specific medical terms that rarely occur in the general domain. To address these challenges, we collect and annotate a new Chinese EMR corpus, named ACEMR, where texts from 500 EMRs (7K sentences) are annotated with CWS and medical concept recognition labels. In addition, we test several state-of-the-art models for CWS and medical concept recognition on the collected ACEMR corpus. Experimental results show the necessity of constructing an informative Chinese medical corpus and provide certain guidance for the model selection in medical domain.

Related Work
NLP for medical text has draw many attentions in the recent years (Xue et al., 2012;Li et al., 2019;Tian et al., 2019Chen et al., 2020b), especially for the EMR texts. Among different tasks to process Chinese EMR texts, CWS and medical concept recognition are two fundamental ones that draw much attentions from previous studies. Due to the dramatic performance drop when applying the model trained from open source corpus on the medical field, previous studies (Xu et al., 2014Li et al., 2015;Zhang et al., 2016;He et al., 2017) always construct Chinese medical datasets themselves and test their models on the datasets. However, most constructed datasets used for CWS are relatively small, where there are only roughly 100 Chinese EMRs. Besides, the medical concept types in most existing datasets are limited to named entities (e.g., "Disease" and "Symptoms and Signs"), which fails to consider other medical concept types (e.g., "Time") in EMRs that are potentially helpful for Chinese EMR texts analysis.

Data Collection
We collected 500 Chinese EMRs from five departments (i.e., Respiratory, Gastroenterology, Urology, Gynecology, and Cardiology) of a local hospital, where one EMR specifically means the First Course Record in the inpatient record for one pa-  tient. First Course Record refers to the first course record written by the treating physician or on-duty physician within eight hours after the patient is admitted to the hospital. It contains seven fields, namely department, ward, basic information, case characteristics, preliminary diagnosis, differential diagnosis, treatment plan, where the last five fields are illustrated in Table 1. We extract the texts in those fields and clean them by anonymizing the text and removing invalid or garbled characters.

CWS and Medical Concept Annotation
Four specialists participated in the development of the annotation guideline, where two of them are junior doctors, and the other two are PhD students in NLP. For CWS guideline, we refer to the segmentation guidelines of the Chinese Treebank (Xia, 2000) for the general domain as well as the annotation guideline proposed by He et al. (2017) for the medical domain. For medical concept annotation guideline, we refer to the medical taxonomy defined by unified medical language system (UMLS) semantic groups (Lindberg et al., 1993) and define 7 major medical concept classes with 20 sub-classes, which are elaborated in Table 2. Compared to existing medical taxonomies, our proposed medical concept classes are simple and clear with fine-grained medical concept focusing on the characteristics of Chinese EMR texts. Note that,  for segmentation, we do not segment one word if it is a defined medical concept. According to the annotation guideline, we ask the two junior doctors to annotate the 500 EMRs independently and resolve their disagreements by discussion. The consistency of labeling between two annotators is evaluated by the F value (Hripcsak and Rothschild, 2005). The specific method is to treat the labeling result of one annotator (A1) as the standard answer, and calculate the F value of the labeling result of the other annotator (A2). The annotation agreement evaluated by the F value between two annotators of CWS and medical concept tagging are 0.9409 and 0.9360, respectively. We name the annotated corpus as Annotated Chinese Electronic Medical Record (ACEMR) and report its statistics in Table 3, where the lengths are computed based on Chinese characters. In addition, the number of medical concepts in ACEMR is also reported in the last column of Table 2. Table 4 shows two annotated example sentences, where Chinese words are split by white spaces 2 . The medical concept tag attached to a specific word is highlighted in red color ("/" is a delimiter between a word and its medical concept tag).

The Corpus Properties
ACEMR is an informative Chinese medical dataset. It contains 500 Chinese EMR texts that are annotated with CWS labels and medical concepts from 20 sub-classes. Due to space limitations, among 20 sub-classes, we introduce three sub-classes (i.e., Group, Health Behavior, and Qualitative) in the following texts. Group includes the patient's gender, age, and name. It generally appears at the beginning of Chinese EMRs as part of the basic information, indicating the group the patient belongs to. In addition, it can also act as a participant in medical and health activities (i.e. patients and doctors). Health Behavior means medical-related behaviors. It mainly includes examination behaviors, diagnostic behaviors, and broad non-specific treat-患者/Gr 老年/Gr 女性/Gr ， 慢性/Ql 病程/Di ， 急 性/Ql 加重/SOS 。 患者/Gr 主/Ql 因/CE " 反复/Ql 咳 嗽/SOS 、 咳痰/SOS , 加重/SOS 3天/T " 入院/E 。 Patient/Gr elderly/Gr female/Gr , chroic/Ql course/Di , acute/Ql exacerbation/SOS . The main/Ql cause/CE of the patient/Gr was " repeated/Ql cough/SOS and sputum/SOS , which became worse/SOS for *3 days*/T " and was *admitted to the hospital*/E . Table 4: An example of annotated medical sentence in ACEMR with the corresponding English translations. The abbreviations of tags are used for annotation. ment behaviors. E.g., "予" (given), "入院治疗" (admission to hospital for treatment). Qualitative emphasizes a qualitative description of something, rather than a direct measurement and can be used to describe the body, abnormalities, etc. E.g., "胃 肠型感冒" (gastrointestinal cold) where "胃肠型" (gastrointestinal) are Qualitative medical concepts.

Methods
A good text representation is highly important in achieving a promising performance in many NLP tasks (Song et al., 2017;Liu and Lapata, 2018;. Therefore, we select several well-known models for CWS and medical concept recognition tasks and test them on ACEMR corpus.

Medical Concept Recognition
Similarly, for medical concept recognition, we regard it as a character-based sequence labeling task and perform it in a similar way with named entity Figure 1: The auto-generated syntactic information (i.e., POS labels, dependency relations, and syntactic constituents) of a sentence with English translation.  recognition, where the medical concept tags for the input characters follow the "BIOES" scheme. For example, "支气管" ("virus") has a medical tag sub-class "BP", and thus the tags for the three characters are "B-BP", "I-BP", and "E-BP", respectively. We try BiLSTM, BERT, ZEN, as well as TwASP (Tian et al., 2020b) with the CRF decoder for medical concept recognition. TwASP is a model that leverages the auto-generated syntactic information (e.g., the POS tags (POS), the dependency relations (Dep.), and the syntactic constituents (Syn.)) through a two-way attention mechanism to improve model performance for sequence labeling tasks. To obtain the syntactic information of the input sentence required by TwASP, we use Stanford CoreNLP Toolkits (Manning et al., 2014) to obtain the POS tags, the dependency tree, and the constituent syntax tree. Figure 1 shows an example sentence (with English translation) and the three types of the auto-generated syntactic information.

Experiments
In the experiments, we use two datasets. The first is the in-domain ACMER corpus introduced in Sec. 3; the second is CTB6 (Xue et al., 2005), which is a benchmark CWS dataset of the general domain text. We split the ACMER corpus into training/test sets and report the statistics in Table 5. For all experiments, we use precision (Prec.), recall, and F1 scores to evaluate different models.  character embeddings from Tencent Embedding 4 , with the training epoch, batch size, and learning rate set to 50, 32, and 0.001, respectively. For BERT, ZEN, and WMSeg, we use the official settings (e.g., 768 dimensional hidden vectors with 12 multi-head self-attentions for BERT), where the number of training epoch is 50, the batch size is 16, and the learning rate is 1e-5. The experimental results of CWS are presented in Table 6 with three different settings (i.e., CTB Only, CTB+ACEMR, and ACEMR Only   WMSeg) can improve the performance on CWS. In addition, models with ZEN encoder achieve higher performance than the ones with BERT, which may result from the fact that ZEN leverage n-gram information during pre-training and thus can obtain a better contextual representation. Moreover, if we train the model on ACEMR only (i.e., the ACEMR only setting), models with ZEN encoder can be further improved. This observation is not surprising because the texts in CTB6 from the general domain could introduce noise into the model.

Performance on Concept Recognition
For medical concept recognition (MCR) task where the gold CWS results are given, the results from BiLSTM, BERT, and ZEN encoder with CRF decoder are reported in Table 7, where ZEN-CRF achieves the highest performance. In addition, we rank the F1 scores of all sub-class labels obtained by ZEN-CRF and present the results of the top and bottom 3 ones in Table 8, where the number of medical concepts belonging to each sub-class in the training set as well as the rate of out-ofvocabulary (OOV) medical concepts in the test set is also reported. It is observed that the model does not perform well on sub-classes with fewer training instances and higher OOV rate (e.g., Body Substance), which suggests that the OOV issue is a challenge for Chinese medical concept recognition.  Table 9: The results of TWASP on medical concept recognition with auto-generated POS labels, dependencies (Dep.), and syntactic constituents (Syn.).
In addition, we run TwASP 5 with three different types of auto-generated syntactic information (i.e., POS labels, dependency relations, and syntactic constituents). The results are reported in Table 9, where we find that MCR can benefit from syntactic information and obtain improvement in most cases, although BERT-CRF and ZEN-CRF baselines have already achieve outstanding performance.

Conclusion
In this paper, we collect a new Chinese medical corpus, named ACEMR, which contains 500 EMRs from a local hospital, and annotate the corpus with CWS and medical concept labels. ACEMR features in the rich types of medical concept, in which 20 sub-classes of medical concepts are annotated. We test several state-of-the-art models for CWS and medical concept recognition on the annotated ACEMR. The results on CWS show that models trained on general domain dataset (i.e., CTB6) cannot perform well on medical domain, which confirms the necessity of constructing the ACEMR corpus. Furthermore, WMSeg with wordhood information and TwASP with auto-generated syntactic information outperforms strong baselines on word segmentation and medical concept recognition, respectively, which demonstrates the benefit of leveraging extra resources (i.e., wordhood information and syntactic information) for CWS and medical concept recognition.