Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition

Using pre-trained word embeddings in conjunction with Deep Learning models has become the “de facto” approach in Natural Language Processing (NLP). While this usually yields satisfactory results, off-the-shelf word embeddings tend to perform poorly on texts from specialized domains such as clinical reports. Moreover, training specialized word representations from scratch is often either impossible or ineffective due to the lack of large enough in-domain data. In this work, we focus on the clinical domain for which we study embedding strategies that rely on general-domain resources only. We show that by combining off-the-shelf contextual embeddings (ELMo) with static word2vec embeddings trained on a small in-domain corpus built from the task data, we manage to reach and sometimes outperform representations learned from a large corpus in the medical domain.

The success of these methods can be arguably attributed to the availability of large generaldomain corpora like Wikipedia, Gigaword (Graff et al., 2003) or the BooksCorpus (Zhu et al., 2015).Unfortunately, similar corpora are often unavailable for specialized domains, leaving the NLP practitioner with only two choices: either using 1 Python code for reproducing our experiments is available at: https://github.com/helboukkouri/acl_srw_2019 general-domain word embeddings that are probably not fit for the task at hand or training new embeddings on the available in-domain corpus, which may probably be too small and result in poor performance.
In this paper, we focus on the clinical domain and explore several ways to improve pre-trained embeddings built from a small corpus in this domain by using different kinds of general-domain embeddings.More specifically, we make the following contributions: • we show that word embeddings trained on a small in-domain corpus can be improved using off-the-shelf contextual embeddings (ELMo) from the general domain.We also show that this combination performs better than the contextual embeddings alone and improves upon static embeddings trained on a large in-domain corpus; • we define two ways of combining contextual and static embeddings and conclude that the naive concatenation of vectors is consistently outperformed by the addition of the static representation directly into the internal linear combination of ELMo; • finally, we show that ELMo models can be successfully fine-tuned on a small in-domain corpus, bringing significant improvements to strategies involving contextual embeddings.

Related Work
Former work by Roberts (2016)  The patient had headache that was relieved only with oxycodone .A CT scan of the head showed microvascular ischemic changes .A followup MRI which also showed similar changes .This was most likely due to her multiple myeloma with hyperviscosity .While interesting for the clinical domain, these strategies may not always be applicable to other specialized fields since large in-domain corpora like MIMIC-III will rarely be available.To deal with this issue, we explore embedding combinations3 .In this respect, we consider both static forms of combination explored in (Yin and Schütze, 2016;Muromägi et al., 2017;Bollegala et al., 2018) and more dynamic modes of combination that can be found in (Peters et al., 2018) and (Kiela et al., 2018).In this work, we show in particular how a combination of general-domain contextual embeddings, fine-tuning, and in-domain static embeddings trained on a small corpus can be employed to reach a similar performance using resources that are available for any domain.
3 Evaluation Task: i2b2/VA 2010 Clinical Concept Detection We evaluate our embedding strategies on the Clinical Concept Detection task of the 2010 i2b2/VA challenge (Uzuner et al., 2011).

Data
The data consists of discharge summaries and progress reports from three different institutions: Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center.These documents are labeled and split into 394 training files and 477 test files for a total of 30,946 + 45,404 ⇡ 76,000 sequences4 .

Task and Model
The goal of the Clinical Concept Detection task is to extract three types of medical entities: problems (e.g. the name of a disease), treatments (e.g. the name of a drug) and tests (e.g. the name of a diagnostic procedure).Table 1 shows examples of entity mentions and  To solve this task, we choose a bi-LSTM-CRF as is usual in entity recognition tasks (Lample et al., 2016;Chalapathy et al., 2016;Habibi et al., 2017).Our particular architecture uses 3 bi-LSTM layers with 256 units, a dropout rate of 0.5 and is implemented using the AllenNLP framework (Gardner et al., 2018).During training, the exact span F1 score is monitored on 5,000 randomly sampled sequences for early-stopping.

Embedding Strategies
We focus on two kinds of embedding algorithms: static embeddings (word2vec) and contextualized embeddings (ELMo).The first kind assigns to each token a fixed representation (hence the name "static"), is relatively fast to train but does not manage out-of-vocabulary words and polysemy.The second kind, on the other hand, produces a contextualized representation.As a result, the word embedding is adapted dynamically to the context and polysemy is managed.Moreover, in the particular case of ELMo, word embeddings are character-level, which implies that the model is able to produce vectors whether or not the word is part of the training vocabulary.
Despite contextualized embeddings usually performing better than static embeddings, they still require large amounts of data to be trained successfully.Since this data is often unavailable in specialized domains, we explore strategies that combine off-the-shelf contextualized embeddings with static embeddings trained on a small indomain corpus.

Static Embeddings
First, we use word2vec5 to train embeddings on a small corpus built from the task data: i2b2 (2010) 394 documents the training set to which we added 826 more files from a set of unlabeled documents.This is a small (1 million tokens) in-domain corpus.Similar corpora will often be available in other specialized domains as it is always possible to build a corpus from the training documents.
Then, we also train embeddings on each of two general-domain corpora: Wikipedia (2017) encyclopedia articles from the 01/10/2017 data dump6 .This is a large (2 billion tokens) corpus from the general domain that has limited coverage of the medical field.
Gigaword ( 2003) newswire text data from many sources including the New York Times.This is a large (2 billion tokens) corpus from the general domain with almost no coverage of the medical field.

Contextualized Embeddings
We use two off-the-shelf ELMo models7 : ELMo small a general-domain model trained on the 1 Billion Word Benchmark corpus (Chelba et al., 2013).This is the small version of ELMo that produces 256-dimensional embeddings.
ELMo original the original ELMo model.This is a general-domain model trained on a mix of Wikipedia and newswire data.It produces 1024-dimensional embeddings.
Additionally, we also build embeddings by finetuning each model on the i2b2 corpus.The finetuning is achieved by resuming the training of the ELMo language model on the new data (i2b2).At each epoch, the validation perplexity is monitored and ultimately the best model is chosen: ELMo small finetuned the result of fine-tuning ELMo small for 10 epochs.
ELMo original finetuned the result of fine-tuning ELMo original for 5 epochs.

Embedding Combinations
There are many possible ways to combine embeddings.In this work, we explore two methods: Concatenation a simple concatenation of vectors coming from two different embeddings.This is denoted X Y (e.g.i2b2 Wikipedia).
Mixture in the particular case where ELMo embeddings were combined with word2vec vectors, we can directly add the word2vec embedding in the linear combination of ELMo.
The mixture method generalizes the way ELMo representations are combined.Given a word w, if we denote the three internal representations produced by ELMo (i.e. the CharCNN, 1 st bi-LSTM and 2 nd bi-LSTM representations) by h 1 , h 2 , h 3 , we recall that the model computes the word's embedding as: where and {↵ i , i = 1, 2, 3} are tunable taskspecific coefficients8 .Given h w2v , the word2vec representation of the word w, we compute a "mixture" representation as: where is a new tunable coefficient9 .

Results and Discussion
We run each experiment with 10 different random seeds and report performance in mean and standard deviation (std).Values are expressed in terms of strict F1 measure that we compute using the official script from the i2b2/VA 2010 challenge.

Using General-domain Resources
Table 3 shows the results we obtain using generaldomain resources only.The top part of the table shows the performance of word2vec embeddings trained on i2b2 as well as two general-domain corpora: Wikipedia and Gigaword.We see that i2b2 performs the worst despite being trained on indomain data.This explicitly showcases the challenge faced by specialized domains and confirms that training embeddings on small in-domain corpora tends to perform poorly.As for the general domain embeddings, we can observe that Wikipedia is slightly better than Gigaword.This can be explained by the fact that the former has some medical-related articles which implies a better coverage of the clinical vocabulary compared to the newswire corpus Gigaword 10 .We can also see that combining general-domain word2vec embeddings with i2b2 results in weak improvements that are slightly higher for Gigaword probably for the same reason.
The middle part of the table shows the results we obtain using off-the-shelf contextualized representations.Looking at the embeddings alone, we see that ELMo small performs worse than i2b2 while ELMo original is better than all word2vec embeddings.Again, the reason for the small model's performance might be related to the different training corpora.In fact, ELMo original, aside from being a larger model, was trained on Wikipedia articles which may include some medical articles.Another interesting point is that both the mean and variance of the performance when using off-the-shelf ELMo models improve notably when combined with word2vec embeddings trained on i2b2.This improvement is even greater for the small model, probably because it has less coverage of the medical domain.Furthermore, we see that the performance improves again, although to a lesser extent when the word2vec embedding is mixed with ELMo instead of combined through concatenation.
The bottom part of the table shows the results obtained after fine-tuning both ELMo models.We see that fine-tuning improves all the results (but to varying extents), with the best performance being achieved using combinations-either concatenation or mixture-of i2b2's word2vec and the larger fine-tuned ELMo.
Two points are worth being noted.First, it is interesting to see that we achieve good results with a model that only uses an off-the-shelf model and a small in-domain corpus built from the task data.This is a valuable insight since the same strategy could be applied for any specialized domain.Second, we see that the smaller 256dimensional ELMo model, which initially performed very poorly (⇡ 80 F1), improved drastically (⇡ +6 F1) using our best strategy and does not lag very far behind the original 1024dimensional model.This is also valuable since many practitioners do not have the computational resources that are required for using the larger versions of recent models like ELMo or BERT.

Using In-domain Resources
It is natural to wonder how our results fare against models trained on large in-domain corpora.Fortunately, there are two such corpora in the clinical domain: MIMIC III (2016) a collection of medical notes from a large database of Intensive Care Unit encounters at a large hospital (Johnson et al., 2016) 11 .This is a large (1 billion tokens) indomain corpus.
PubMed (2018) a collection of scientific article abstracts in the biomedical domain 12 .This is a large (4 billion tokens) corpus from a close but somewhat different domain.
Both Zhu et al. (2018) and Si et al. ( 2019) trained the ELMo (original) on MIMIC, with the former resorting to only a part of MIMIC mixed with some curated Wikipedia articles.Table 4 reports their results, to which we add the performance of strategies using word2vec embeddings trained on MIMIC and PubMed, and an opensource ELMo model trained on PubMed 13 .
We can see yet again that word2vec embeddings perform less well than ELMo models trained on the same corpora.We also see that combining the two kinds of embeddings still brings some improvement (see ELMo (PubMed) + + MIMIC).And more importantly, we observe that by using only general-domain resources, we perform very close to the ELMo models trained on a large indomain corpus (MIMIC) with a maximum difference in F1 measure of ⇡ 1.5 points.

Using GloVe and fastText
In order to make sure that the observed phenomena are not the result of using the word2vec method in particular, we reproduce the same experiments using GloVe and fastText14 .The corresponding results are reported in Table 5 and Table 6.We can see that GloVe and fastText are always outperformed by word2vec when trained on a single corpus only.This is not true anymore when combining these embeddings with representations from ELMo.In fact, in this case, the results are mostly comparable to the performance obtained when using word2vec, with a slight improvement when using fastText.This small improvement may be explained by the fact that the fastText method is able to manage Out-Of-Vocabulary tokens while GloVe and word2vec are not.
More importantly, these additional experiments validate the initial results obtained with word2vec: static embeddings pre-trained on a small indomain corpus (i2b2) can be combined with general domain contextual embeddings (ELMo), through either one of the proposed methods, to reach a performance that is comparable to the state-of-the-art15 .

Limitations
We can list the following limitations for this work: • we tested only one specialized domain on one task using one NER architecture.Although the results look promising, they should be validated by a wider set of experiments; • our best strategies use the task corpus (i2b2) to adapt general off-the-shelf embeddings to the target domain, then combine two different types of embeddings as an ensemble to boost performance.This may not work if the task corpus is really small (we recall that our corpus is ⇡ 1 million tokens).

Conclusion and Future Work
While embedding methods are improving on a regular basis, specialized domains still lack large enough corpora to train these embeddings successfully.We address this issue and propose embedding strategies that only require general-domain resources and a small in-domain corpus.In particular, we show that using a combination of generaldomain ELMo, fine-tuning and word2vec embeddings trained on a small in-domain corpus, we achieve a performance that is not very far behind that of models trained on large in-domain corpora.Future work may investigate other contextualized representations such as BERT, which has proven to be superior to ELMo-at least on our task-in the recent work by Si et al. (2019).Another inter-esting research direction could be exploiting external knowledge (e.g.ontologies) that may be easier to find in specialized fields than large corpora.
Echocardiogram on **DATE[Nov 6 2007] , showed ejection fraction of 55% , mild mitral insufficiency , and 1+ tricuspid insufficiency with mild pulmonary hypertension .DERMOPLAST TOPICAL TP Q12H PRN Pain DOCUSATE SODIUM 100 MG PO BID PRN Constipation IBUPROFEN 400-600 MG PO Q6H PRN Pain analyzed the trade-off between corpus size and similarity when training word embeddings for a clinical entity recognition task.The author's conclusion was that while embeddings trained with word2vec on indomain texts performed generally better, a combination of both in-domain and general domain em-3.

Table 1 :
Examples of entity mentions (Problem, Treatment, and Test) from the i2b2 2010 dataset * .
Table 2 shows the distribution of each entity type in the training and test sets.

Table 2 :
Distribution of medical entity types.

Table 3 :
Performance of various strategies involving a general-domain resource and a small in-domain corpus (i2b2).The values are Exact Span F1 scores given as Mean ± Std (bold: best result for each kind of combination).