Adapting Pre-trained Word Embeddings For Use In Medical Coding

Word embeddings are a crucial component in modern NLP. Pre-trained embeddings released by different groups have been a major reason for their popularity. However, they are trained on generic corpora, which limits their direct use for domain specific tasks. In this paper, we propose a method to add task specific information to pre-trained word embeddings. Such information can improve their utility. We add information from medical coding data, as well as the first level from the hierarchy of ICD-10 medical code set to different pre-trained word embeddings. We adapt CBOW algorithm from the word2vec package for our purpose. We evaluated our approach on five different pre-trained word embeddings. Both the original word embeddings, and their modified versions (the ones with added information) were used for automated review of medical coding. The modified word embeddings give an improvement in f-score by 1% on the 5-fold evaluation on a private medical claims dataset. Our results show that adding extra information is possible and beneficial for the task at hand.


Introduction
Word embeddings are a recent addition to an NLP researcher's toolkit. They are dense, real-valued vector representations of words that capture interesting properties among them. Word embeddings are learned from raw corpora. Usually, the larger the corpora, the better is the quality of the embeddings learned. However, the larger the corpora, the larger is the amount of resources and time needed for their training. Thus, different groups release their learned embeddings publicly. Such pre-trained embeddings is a primary reason for the inclusion of word embeddings in mainstream NLP. However, such pre-trained embeddings are usually learned on generic corpora. Using such embeddings in a particular domain such as medical domain leads to following problems: • No embeddings for domain-specific words.
For example, phenacetin is not present in pretrained vectors released by Google.
• Even those words that do have embeddings, may have a poor quality of the embedding, due to different senses of the words, some of which belonging to different domains.
It is difficult to obtain large amounts of domainspecific data. However, many NLP applications have benefited from the addition of information from small domain-specific corpus to that obtained from a large generic corpus (Ito et al., 1997). This raises the following questions: • Can we use additional domain-specific data to learn the missing embeddings?
• Can we use additional domain-specific data to improve the quality of already available embeddings?
In this paper, we address the second question: Given pre-trained word embeddings, and domain specific data, we tune the pre-trained word embeddings such that they can achieve better performance. We tune the embeddings for and evaluate them on an automated review of medical coding.
The rest of the paper is organized as follows: Section 2 provides some background on different notions used later in the paper. Section 3 motivates our approach through examples. Section 4 explains our approach in detail. Section 5 enlists the experimental setup. Section 6 details the results and analysis, followed by conclusion and future work.

Word Embeddings
Word embeddings are a crucial component of modern NLP. They are learned in an unsupervised manner from large amounts of raw corpora. Bengio et al. (2003) were the first to propose neural word embeddings. Many word embedding models have been proposed since then (Collobert and Weston, 2008;Huang et al., 2012;Mikolov et al., 2013;Levy and Goldberg, 2014). The central idea behind word embeddings is the distributional hypothesis, which states that words which are similar in meaning occur in similar contexts (Rubenstein and Goodenough, 1965). Consider the Continuous Bag of Words model by (Mikolov et al., 2013), where the following problem is poised to a neural network: given the context, predict the word that comes in between. The weights of the network are the word embeddings. Training the model over running text brings embeddings of words with similar meaning closer.

Medical Coding
Medical coding is the process of assigning predefined alphanumeric medical codes to information contained in patient medical records. Babre et al. (2010) shows a typical medical coding pipeline. Note that the coding (both automatic and/or manual) is followed by a manual review. This is due to the critical nature of the coding process, and the high cost incurred due to any errors. However, any human involvement increases cost both in terms of time and money. Thus, in order to reduce human involvement in the review process, an automatic review component can be inserted just before the human review. Automated reviewing is a binary classification problem. Those instances that are rejected by the automated review component can be directly sent back for recoding, whereas those instances that are accepted by the automated review component should be sent to human reviewers for further checking. Such a modification decreases the load on the human reviewer, thereby reducing the cost of overall pipeline.
Given the textual nature of medical data, many natural language processing challenges manifest themselves while performing either automated medical coding or automated review of medical coding. Common challenges include, but are not limited to: • Synonymy: Multiple words can have same meaning (Synonym). For instance, High Blood Sugar and Diabetes have the same meaning.
• Abbreviation: Medical staff, in their hurry, often abbreviate words and sentences. For instance, hypertension can be written as HTN.
The automated system needs to understand that both these strings ultimately mean the same thing.
One can note that both in case of synonym and abbreviations, the context will be almost same. Thus, word embeddings are well suited to handle both these challenges.

Motivation
Consider the following medical terms (the abbreviations in parentheses will be used to refer to the terms later): -High Blood Pressure (HBP) -Low Blood Pressure (LBP) -High Blood Sugar (HBS) -Liver Failure (LF) -Diabetes (D) -Hypertension -HTN We would ideally like the embeddings of the terms to be learned such that the following constraints hold: • Similarity (HBP, HBS) should be higher than Similarity (HBP, LBP), which in turn, should be higher than Similarity (HBP, LF) (as per medical knowledge).
• Similarity (HBS, D) should be high (as they are synonyms).
• Similarity (Hypertension, HTN) should be high (as HTN is abbreviation of hypertension).
Information about such relations might not be available in generic corpus on which most pretrained embeddings are trained. However, it might be available in domain specific corpora, or even labeled data, such as those used in medical claims.
Approaches that can add that information to pretrained embeddings will definitely improve their utility.
We adapt the Continuous Bag Of Words (CBOW) approach (Mikolov et al., 2013) for our situation. Given labeled medical claims data, we consider the terms in the transcripts as context words, and the corresponding codes as target word. We have both positive and negative samples in our data. Thus we have both normal samples as well as negative samples needed for applying negative sampling. Figure 1: Network architecture of our approach Figure 1 shows the network of our approach. The inputs to the network are a bag of words representation of medical terms, and a one-hot representation of the corresponding code. The output of the network is a binary value indicating whether the input code is accepted for the corresponding input medical terms.

Exploiting ICD10 Code hierarchy
Another information that can be included is the hierarchical nature of the ICD10 code set. Currently, the network considers the error of misclassifying codes in same subcategory, say F32.9 and F11.20, the same as the error of misclassifying codes belonging to different subcategories, say F32.9 and 30233N1. Ideally, error(F32.9, F11.20) should be less than error(F32.9, E87.1), which in turn should be less than error(F32.9, 30233N1). Such hierarchical information can be encoded by a network like the one in figure 2. Due to resource and time constraints, we have currently considered only the top level hierarchy, i.e. whether the code is ICD-10 Diagnosis or ICD-10 Procedural.
The learned weights between Proj 1 and codes input in hierarchy network (figure 2) are used to initialize the weights between Proj 2 and codes in the original network (figure 1). Then the original network is trained as usual. The weights between

Dataset
We used a private medical claims review dataset, which we cannot release publicly due to privacy concerns. The dataset consists of 280k records, consisting of medical terms along with a code. Each entry is labeled as accept or reject, depending on whether the entry has correct code, or whether it was sent for recoding.

Pre-trained word embeddings
We used 5 different pre-trained word embeddings. The first one is the one released along with Google's word2vec toolkit. The remaining four are medical domain specific, and were released by (Pyysalo et al., 2013). They are as follows: • PMC: Trained on 4 million PubMed Central's full articles • PubMed: Trained on 26 million abstracts and citations in PubMed.
• PubMed PMC: Trained on combination of previous two resources • Wikipedia PubMed PMC: Trained on combination of Wikipedia, PubMed and PMC resources.

Classifiers
Once we tune the embeddings, we use them to learn a binary classifier. For our experiments, we report the results we got by using logistic regression..

Related Work
Word embeddings have proved to be useful for various tasks, such as Part of Speech Tagging (Collobert and Weston, 2008), Named Entity Recognition Sentence Classification (Kim, 2014), Sentiment Analysis (Liu et al., 2015), Sarcasm Detection (Joshi et al., 2016). Medical domain specific pre-trained word embeddings were released by different groups, such as Pyysalo et al. (2013), Brokos et al. (2016), etc. Wu et al. (2015) apply word embeddings for clinical abbreviation disambiguation.

Conclusion and Future Work
In this paper, we proposed a modification of the CBOW algorithm to add task and domain specific information to pre-trained word embeddings. We added information from a medical claims dataset and the ICD-10 code hierarchy to improve the utility of the pre-trained word embeddings. We obtained an improvement of approximately 1% using the modified word embeddings as compared to using the original word embeddings. Such improvement was achieved by including only the top level hierarchy. We hypothesize that using the full hierarchy will lead to better improvements, which we shall investigate in the future.