On Learning Better Embeddings from Chinese Clinical Records: Study on Combining In-Domain and Out-Domain Data

High quality word embeddings are of great significance to advance applications of biomedical natural language processing. In recent years, a surge of interest on how to learn good embeddings and evaluate embedding quality based on English medical text has become increasing evident, however a limited number of studies based on Chinese medical text, particularly Chinese clinical records, were performed. Herein, we proposed a novel approach of improving the quality of learned embeddings using out-domain data as a supplementary in the case of limited Chinese clinical records. Moreover, the embedding quality evaluation method was conducted based on Medical Conceptual Similarity Property. The experimental results revealed that selecting good training samples was necessary, and collecting right amount of out-domain data and trading off between the quality of embeddings and the training time consumption were essential factors for better embeddings.


Introduction
Word embeddings, or embeddings for short, have been widely used in various natural language processing tasks, such as language modeling (Bengio et al., 2003;Sundermeyer, et al. 2012;Adams et al., 2017), syntactic parsing (Grefenstette et al., 2014;Tu et al., 2017) and part-ofspeech tagging (Yang and Eisenstein, 2016). Owing to the advantage of embeddings in boosting performance, a surge of interest in applying embeddings has become increasingly evident with numerous encouraging results in the field of biomedical applications, e.g. disease prediction (Miotto et al., 2016), clinical events prediction (Choi et al., 2016a), medical concept disambigua-tion (Tulkens et al., 2016), and biomedical information retrieval (Mohan et al., 2017).
Learning embeddings from English medical texts, as a hot topic in recent years, has been extensively studied due to the efforts of open datasets, such as UMLS of NLM (Bodenreider, 2004), medical journal abstracts from PubMed (Choi et al., 2016a), and some released clinical data (Finlayson, et al., 2014;Stubbs and Uzuner, 2015). These datasets have been widely used as gold standards by the biomedical natural language processing domain for learning embeddings (De Vine et al., 2014;Choi et al., 2016b).
However, the development of learning embeddings from Chinese medical texts has fallen far behind, especially from Chinese clinical records. Due to the privacy concerns, Chinese clinical records that can be used are generally limited. Learning better embeddings based on neural network architectures, for instance the widely used skipgram model (Mikolov et al., 2013a), usually needs a large number of training data. As a result, the learned embeddings from Chinese clinical records are not good enough.
Moreover, to the best of our knowledge, there is a limited number of studies focusing on learning embeddings from Chinese clinical records, not to mention the embedding evaluation. Many methods have been developed to learn embeddings from English medical texts, however, Chinese medical texts, especially clinical records, have their particular language features. Therefore, adaptions to the approaches of learning embeddings from English medical texts are urgently needed for learning embeddings from Chinese clinical records.
In this paper, we focused on learning embeddings from Chinese clinical records, and our major contributions were as follows: • We proposed an in-domain and out-domain data combination method for learning better embeddings from Chinese clinical records by the skip-gram model under the situation that we only have limited Chinese clinical records.
• Referring to the evaluation method for medical concept embeddings proposed in (Choi et al., 2016b) which is based on medical conceptual similarity property, we proposed a method for distantly evaluating the learned embeddings from Chinese clinical records using an additional standard medical terminology dataset.
• We found that selecting good training samples is necessary. Collecting right amount of out-domain data, trading off between the quality of embeddings and the training time consumption are essential factors for better embeddings.

Skip-Gram Model for Learning Embeddings
The skip-gram model is one of the most popular methods for learning embeddings from texts. The training objective of the skip-gram model is to find an embedding that is useful for predicting context words of one target word in a sequence. The sequence usually refers to a sentence in a specific task. In the skip-gram model, if two different target words ! " and ! #$ have (very) similar context words, then learned embeddings of ! " and ! #$ by the model would be (very) similar, because a common output weight matrix is used (Mikolov et al., 2013b). In other words, if we want to clearly distinguish two target words' embeddings, we can provide more informative context words that differentiate the target words.
The skip-gram model has been used in various domain to learn embeddings from different types of texts, and there have been also various relevant attempts to learn embeddings from medical texts by the skip-gram model. Most works directly applied the model on various medical corpora to complete this domain-specific task (Giménez et al., 2013;. In this paper, we continued the previous work using the skip-gram model to learn embeddings from Chinese clinical records to further explore a data combination method for improving the quality of the learned domain-specific embeddings.

Observation
Content of Chinese clinical records are usually brief, the occurrence of symptoms and diseases has certain correlation, and doctors have a certain habit in inquiring procedures and making records. These domain-specific characteristics challenge learning embeddings from Chinese clinical records, because it gives general domain words a high probability of having similar or even identical context words to those medical words. For example, in Figure 1, general domain word " " (sometimes) and medical term " " (eye, the body part) have similar context words with medical word " " (abdomen, the body part), and " " (sometimes) has more common context words with " " (abdomen) than " " (eye). Moreover, it would like to be a fixed pattern to describe certain medical problems. As a result, learned embeddings of " " (sometimes) and " " (abdomen) would be more similar than embeddings of " " (eye) and " " (abdomen), although " " (abdomen) and " " (eye) belong to the same type of medical concept (i.e. the body part). In summary, the main challenge of learning better embeddings from Chinese clinical records is to let the skip-gram model make a clearer distinction between medical words and general domain words.

Usage of Out-Domain Data
As mentioned earlier, making a clearer distinction between learned embeddings of two target words by skip-gram model requires more evidences, i.e. adding diverse context words to illustrate the difference between the two target words. Therefore, we proposed a hypothesis that adding general domain Chinese texts, i.e. the out-domain data, to Chinese clinical records, i.e. the in-domain data, would facilitate the learning of embeddings from Chinese clinical records. The intuition is that the medical words in Chinese clinical records have domain-specific usage but are not widely used in the out-domain data. However, the general domain words have a wide range of usage in the outdomain data, which is the exact opposite of using medical words. Combining out-domain data with Chinese clinical records can improve the diversity of context words of the general domain words, but without the side-effect of impairing the contexts of the medical words. Better embeddings, in turn, can be learned from the combined data.

Learning Process and Embedding Quality Evaluation Method
Chinese clinical records were segmented into words by the latest version of Stanford CoreNLP tool 1 with default settings, and adjacent words appearing in our prepared standard medical word dataset would not be segmented (Zhang et al., 2016). Punctuations were removed. Out-domain data went through a similar process but without the second process. We assume that in out-domain data there is no medical words. We directly applied skip-gram model implemented by DeepLearn-ing4J 2 to learn embeddings. Hierarchical SoftMax is used in training process, and context window size and embedding dimensionality are set to 5 and 200 respectively (Choi et al, 2016b). We used an intrinsic evaluation method, named Chinese Medical Concept Similarity Measure (CMCSM), to distantly measure quality of learned embeddings. CMCSM is defined below: where ; is the number of groups of the medical words in the same level of a prepared medical word dataset <, % = ∈ < is one group of the medical words, and 4 ? and 4 @ are the ?th and @th terms in % = . 234 5 , 4 " 7 is any commonly used embedding similarity measure (Levy et al., 2015). In this paper, we used the cosine measure. 1 URL: https://nlp.stanford.edu/software/segmenter.shtml. 2 URL: https://deeplearning4j.org/.

Experimental Data
To validate performance of the proposed method, three experimental datasets were used in this paper, including a Chinese clinical records dataset (CCRD) collected from Teaching Hospital of Chengdu University of Traditional Chinese Medicine, a large scale out-domain dataset (ODD) obtained from the NLPCC 2018 Shared Task 4 3 , and a standard medical terminology dataset (SMTD) gotten from WHO 4 . Medical terms in SMTD are organized into a two-layer tree structure. Index of the second layer defines the group id for medical words. Medical words in the same group are more similar. SMTD was used as the prepared medical word dataset < mentioned previously. The detailed information of these datasets was listed in Table 1.

Experimental Data
Firstly, we applied skip-gram model to learn embeddings from CCRD and the learned embeddings were evaluated by CMCSM. We sampled 5 subdatasets from CCRD in order to assess effect of different size of datasets on quality of the learned embeddings. The sizes of the sampled datasets were 80%, 60%, 40%, 20% and 10% of instances in the original CCRD. The sampling process was a recursive sampling without replacement. It implied that more data means more stable learning results of embeddings. Moreover, we ran the above process 10 times to further assess the stability of the results. The results were used as the baseline, and they were shown in Table 2.
We found in Table 2 that the more Chinese clinical records were used for learning embeddings, the smaller variance of CMCSM tended to be achieved. Moreover, an interesting result was that the use of all Chinese clinical records did not nec-3 URL: http://tcci.ccf.org.cn/conference/2018/cfpt.php. 4 We filtered the terminologies which do not appear in CCRD. URL: http://www.wpro.who.int/publications/who_i-strm_file.pdf?ua=1.  essarily result in the highest quality of embeddings. It implies that if we only use in-domain data to learn embeddings, we should collect as much training data as possible and also select helpful samples from the collected data. Secondly, we applied skip-gram model to learn embeddings from combinations of CCRD and ODD with different combination ratios. Results were listed in Table 3, indicating through combining ODD into CCRD, the qualities of the learned embeddings in different conditions were improved dramatically. More ODD data is combined into CCRD, better embeddings would be learned. In the best case (combining the "Time 2-60%" dataset with the "ODD-ALL" dataset), CMCSM increased by 3.8 times.
Notably, the highest quality of the learned embeddings in each row of Table 3 was not always achieved when all data in ODD was used. This result was consistent with the result mentioned earlier, indicating that we should collect as much training data as possible and also need to pay attention to reasonably choosing training samples. In addition, the results showed that when the amount of ODD was 1000 times of the basis size of CCRD, optimal embeddings would be achieved.
Moreover, the results suggested that, in practice, the trade-off between quality of embeddings and training time consumption should be considered. Figure 2 displayed that with increasing the amount of the combined ODD, the growth rate of CMCSM of learned embeddings from basis size of CCRD decreased sharply. Furthermore, when the amount of the combined ODD was more than 50 times of the basis size, the growth rate was almost converged. While, as we know, more data were used for learning embeddings by skip-gram model, much more time would be consumed. We should consider whether it is worthwhile to spend a lot of training time in exchange for very little quality improvement. Moreover, little quality improvement sometimes may not improve performance of downstream biomedical applications.

Discussion
This paper conducted only intrinsic evaluation and requires further research involving results from extrinsic evaluations. The high quality embeddings from intrinsic evaluations is also essential for enhancing performance in downstream applications.
Experimental results in this paper casted light on the quality improvements of learning embeddings from English clinical records. Most of the existing studies about how to train good embeddings are based on data within the same domain (Chiu et al., 2016;Lai et al., 2016).
Further exploration needs to be continued in many aspects. For instance, how to thoroughly understand learning embeddings via complicated neural networks, which is one of current major research hotspots. Only when the complex back-

Conclusions
This paper presented study on how to learn better embeddings from Chinese clinical records with the supplement of out-domain data in the context of limited in-domain data. Proceeding from the Medical Conceptual Similarity Measure (Choi et al., 2016b), we applied it to distantly evaluate the quality of embeddings. The experimental results showed that a combination use of out-domain and in-domain data could potentially improve the quality of learned embeddings; collecting right amount of out-domain data, trading off between the quality of embeddings and the training time consumption, choosing the good training samples were all essential factors for learning better embeddings. Our results also proved that more data did not necessarily bring more satisfying results, which was consistent with results of Chiu et al. (2016).  Table 3: CMCSM Results of the Embeddings Learned from the Combinations of CCRD and ODD by the Skip-Gram Model. "Tn-X%" means that "the dataset is the X% data of CCRD which is used for learning the highest quality of embeddings in Table 2 at Tn," and "CCRD-ALL" means that all instances in CCRD are used. "ODD-n" means that "the size of ODD currently used is 'n'×2505." "ODD-ALL" means all samples in ODD are used. 2505 is the basis size of CCRD, and it is approximately equal to the number of 10% of CCRD.