How to Train good Word Embeddings for Biomedical NLP

The quality of word embeddings depends on the input corpora, model architectures, and hyper-parameter settings. Using the state-of-the-art neural embedding tool word2vec and both intrinsic and extrinsic evaluations, we present a comprehensive study of how the quality of embeddings changes according to these features. Apart from identifying the most inﬂuential hyper-parameters, we also observe one that creates contradictory re-sults between intrinsic and extrinsic evaluations. Furthermore, we ﬁnd that bigger corpora do not necessarily produce better biomedical domain word embeddings. We make our evaluation tools and resources as well as the created state-of-the-art word embeddings available under open licenses from https://github.com/ cambridgeltl/BioNLP-2016 .


Introduction
As one of the main inputs of many NLP methods, word representations have long been a major focus of research. Recently, the embedding of words into a low-dimensional space using neural networks was suggested (Bengio et al., 2003;Collobert and Weston, 2008;Turian et al., 2010;Mikolov et al., 2013b;Pennington et al., 2014). These approaches represent each word as a dense vector of real numbers, where words that are semantically related to one another map to similar vectors. Among neural embedding approaches, the skip-gram model of Mikolov et al. (2013a) has achieved cutting-edge results in many NLP tasks, including sentence completion, analogy and sentiment analysis (Mikolov et al., 2013a;Mikolov et al., 2013b;Fernández et al., 2014).
Although word embeddings have been studied extensively in recent work (e.g. Lapesa and Evert (2014)), most such studies only involve general domain texts and evaluation datasets, and their results do not necessarily apply to biomedical NLP tasks. In the biomedical domain, Stenetorp et al. (2012) studied the effect of corpus size and domain on various word clustering and embedding methods, and Muneeb et al. (2015) compared two state-of-the-art word embedding tools: word2vec and Global Vectors (GloVe) on a word-similarity task. They showed that skip-gram significantly out-performs other models and that its performance can be further improved by using higher dimensional vectors. The word2vec tool was also used to create biomedical domain word representations by Pyysalo et al. (2013) and Kosmopoulos et al. (2015).
Given that word2vec has been shown to achieve state-of-the-art performance that can be further improved with parameter tuning, we focus on its performance on biomedical data with different inputs and hyper-parameters. We use all available biomedical scientific literature for learning word embeddings using models implemented in word2vec. For intrinsic evaluation, we use the standard UMNSRS-Rel and UMNSRS-Sim datasets (Pakhomov et al., 2010), which enable us to measure similarity and relatedness separately. For extrinsic evaluation, we apply a neural network-based named entity recognition (NER) model to two standard benchmark NER tasks, JNLPBA (Kim et al., 2004) and the BioCreative II Gene Mention task (Smith et al., 2008).
Apart from showing that the optimization of hyper-parameters boosts the performance of vectors, we also find that one such parameter leads to contradictory results between intrinsic and extrinsic evaluations. We further observe that a larger corpus does not necessarily guarantee better re-

Corpora and Pre-processing
We use two corpora to create word vectors: the PubMed Central Open Access subset (PMC) and PubMed. PMC is a digital archive of biomedical and life science literature, which contains more than 1 million full-text Open Access articles. The PubMed database has more than 25 million citations that cover the titles and abstracts of biomedical scientific publications. A version of PMC articles is distributed in text format 1 whereas PubMed is distributed in XML. Thus, we use a PubMed text extractor 2 to extract title and abstract texts from the PubMed source XML. Both PubMed and PMC were pre-processed with the Genia Sentence Splitter (GeniaSS) (Saetre et al., 2007), which is optimized for bio-medical text. We further tokenize the sentences with the Tree bank Word Tokenizer provided by the NLTK python library (Bird, 2006). The corpus statistics are shown in Table 1.

Word vectors
Factors that affect the performance of word representations include the training corpora, the model architectures, and the hyper-parameters. To assess the effect of corpora, we generate three variants of each set of word vectors: one from PubMed, one from PMC, and one from the combination of the two (PMC-PubMed). To study how preprocessing affects word vectors, we create vectors from the original text corpora, lower-cased variants, and variants where sentences are shuffled in random order. We further generate two sets of vectors, one by applying the skip-gram model and one applying the CBOW model, built with the default hyperparameter values of word2vec. We first evaluate these vectors to determine the better-performing model architecture. Using the better model, we  We repeat the process for every hyper-parameter under examination. We then report the results of these sets of vectors in our intrinsic and extrinsic evaluations.

Hyper-parameters
We test the following key hyper-parameters: Negative sample size (neg): the representation of a word is learned by maximizing its predicted probability to co-occur with its context words, while minimizing the probability for others. However, the normalisation of this probability involves a denominator deriving from co-occurrences between words and all their contexts in the corpus, which is time-consuming to compute. To address this issue, negative sampling only calculates the probability with reference to a set number of other randomly chosen negative words (neg).
Sub-sampling (samp): Sub-sampling refers to the process of reducing occurrences of frequent words. It selects words appearing with a ratio higher than the threshold samp, and ignores each occurrence with a given probability. The process is used to minimise the effect of non-informative frequent words in training. Very frequent words (e.g. in) are less informative because they co-occur with most words in the corpus. For example, a model can benefit more from seeing an occurrence of p16 with CDKN2 than an instance of the frequent co-occurrence of p16 with in.
Minimum-count (min-count): The minimumcount defines the minimum number of occurrences required for a word to be included in the word vectors. This parameter allows control the over the size of the vocabulary and, consequently, the resulting word embedding matrix.
Learning Rate (alpha): neural networks are trained by gradually updating weight vectors Vector Token PMC-PubMed (Pyysalo et al.) 5,487,486,225 (total) PMC (Pyysalo et al.) 2,591,137,744 (total) PubMed (Pyysalo et al.) 2,896,348,481 (total) PubMed (Kosmopoulos et al.) 1,701,632 (distinct)  along a gradient to minimize an objective function. The magnitude of these updates is controlled by the learning rate.
Vector dimension (dim): The vector dimension is the size of the learned word vector. While a higher dimension tends to capture better word representations, their training is more computationally costly and produces a larger word embedding matrix.
Context window size (win): The size of the context window defines the range of words to be included as the context of a target word. For instance, a window size of 5 takes five words before and after a target word as its context for training.
We refer to Mikolov et al. (2013a) and Levy et al. (2015) for further details regarding these parameters.

Baseline Vectors
As baselines, we include the biomedical domain vectors created by Pyysalo et al. (2013) and Kosmopoulos et al. (2015). Their corpus statistics are shown in Table 3. All of these vectors are built with the skip-gram model with the default parameter values (see Table 2).

Intrinsic Evaluation
A standardized intrinsic measure for word representations in the biomedical domain is the UMN-SRS word similarity dataset (Pakhomov et al., 2010). We use its UMNSRS-Sim (Sim) and UMNSRS-Rel (Rel) subsets as our references. They have 566 and 587 word pairs for measuring similarity and relatedness (respectively) whose degree of association was rated by participants from the University of Minnesota Medical School. In UMNSRS, the human evaluation on every word pair is converted to a score to determine its degree of similarity, a higher score implying a more similar pair. The range of the score is on an arbitrary scale. While UMNSRS provides scores to determine the degree of similarity for each word pair, we will measure this by calculating the cosine similarity score for each word pair using the learned word vectors. Afterwards, we compare the two scores using Spearman's correlation coefficient (ρ), which is a standard metric to compare ranking between variables regardless of scale in word similarity task. We systematically ignore words that appear only in the reference but not in our models.

Extrinsic Evaluation
Given that the ultimate evaluation for word vectors is their performance in downstream applications, we also assess the quality of the vectors by performing NER using two well-established biomedical reference standards: the BioCreative II Gene Mention task corpus (BC2) (Smith et al., 2008) and the JNLPBA corpus (PBA) (Kim et al., 2004). Both of these corpora consist of approximately 20,000 sentences from PubMed abstracts manually annotated for mentions of biomedical entity names. Following the window approach architecture with word-level likelihood proposed by Collobert and Weston (2008), we apply a tagger built on a simple feed-forward neural network, with a window of five words, one hidden layer of 300 neurons and a hard sigmoid activation, leading to a Softmax output layer. Our word vectors are used as the embedding layer of the network, with the only other input being a low-dimensional binary vector of word surface features. 3 To emphasize the effect of the input word vectors on performance, we avoid fine-tuning the word vectors during training as well as introducing any external resources such as entity name dictionaries. While this causes the performance of the method to fall notably below the state of the art, we believe this minimal approach to be an effective way to focus on the quality of the word vectors as they are created by the tool (word2vec). 4 For parameter selection, we estimate the extrinsic performance of word vectors on the development sets of the two corpora using mention-level F-score. For the final experiment with selected parameters we apply the test sets and evaluation scripts of the two tasks in accordance with their original evaluation protocols.   3 Results

Skip-grams vs. CBOW
Tables 4 and 5 (first 2 rows) show results comparing the skip-gram and CBOW models with default hyper-parameter values in intrinsic and extrinsic evaluation, respectively. In general, the skip-gram vector shows better results than CBOW in both the word similarity task and in entity mention tagging. In CBOW, the representations of a group of context words are learned through predicting one focus word, with the prediction back-propagated averaged over all context words. By contrast, in skip-gram, the representation of a focus word is learned by predicting every other context word in the window separately, with the prediction error of each context word back-propagated to the target word. This may allow better vectors to be learned as a focus word is trained over more data, but with less smoothing over contexts. Our result is consistent with that of many previous studies, including that of Muneeb et al. (2015), who compared model architectures on different vector dimensions and reported that skip-gram outperforms CBOW in biomedical domain tasks.  From Tables 4 and 5, we see that most vectors benefit from lower-casing and shuffling the corpus sentences. Since in word2vec, the learning rate is decayed as training progresses, text appearing early has a larger effect on the model. Shuffling makes the effect of all text (roughly) equivalent. On the other hand, lower-casing ensures that same word but different cases, such as protein, Protein and PROTEIN are normalised (indexed as one term) for training. Although the shuffled-lower vectors perform better, in the following, we report further results based on the unshuffled-text vector to preserve the comparability of results.

Hyper-Parameters
We next show that four out of the six hyperparameters only improve performance notably in the intrinsic task but not the extrinsic one, while one boosts figures in both tasks to a great extent.
Lastly, one of them shows opposite effects on intrinsic and extrinsic evaluations.   Intuitively, larger values of the neg parameter could be expected to benefit the training process by providing more (negative) examples, but we can only see a benefit in the intrinsic result (Figure 1). The performance of word vectors on the intrinsic task generally improves as neg increases from 1 to 8 (Table 6), whereas extrinsic task performance remains approximately the same (Table 7). We refer to Levy et al. (2015) for further analysis of the effect of the skip-gram parameter in a general domain context. Regarding sub-sampling, a lower threshold gives more words a probability of being downsampled. From Figure 2, it appears that also subsampling has a large effect on the intrinsic task, where most figures increase substantially before samp = 1e-6 (    continuously, a substantial amount of informative frequent words are downsampled, leading to an ineffective learning of the representation. Words occurring fewer than min-count times will be completely removed from the corpus, resulting in fewer words in the word vectors. From Figure 3, most of the results show limited effect for this parameter, excepting a notable increase for PubMed vectors in the intrinsic task (Table 10).   However, our intrinsic evaluations, following the standard protocol, ignore words that are excluded by min-count. Hence, for PubMed vectors, when min-count = 400, only about half of the assessment items are used in intrinsic evaluation. This implies that the result in min-count > 400 only reflects the representation of frequent words. By contrast, as the out-of-vocabulary rate in extrinsic tasks is about 2.6%, its influence is less notable. The learning process will be unstable if the learning rate is too large and will be slow if it is too small. From table 12 and table 13, alpha = 0.05 appears to be an optimal value, for which most of the vectors have their best or second best results in both evaluations.

Vector Dimension (dim)
The effect of vector dimension on our vectors is notable in all tasks ( Figure 5). In Tables 14 and 15, we see a large improvement in all evaluations when the vector dimension grows. Although the improvement for extrinsic measures stops when dim > 200, it is evident that an increase from low

Context Window Size (win)
We find contradictory results from changing the size of the context window parameter ( Figure 6). All three sets of vectors show a notable increase in the intrinsic measures when the context window size grows (Table 16). However, the extrinsic evaluation shows the opposite pattern (Table 17): all results in extrinsic tasks have an early perofmance peak with a narrow window (e.g. win = 1), followed by a gradual decrease when window size increases. One possible explanation may be that a larger window emphasizes the learning of domain/topic similarity between words, while a narrow context window leads the representa-     , 2012). It is possible that for intrinsic evaluation datasets such as UMNSRS it is more important to model topical rather than functional similarity. Conversely, it is intuitively clear that for tasks such as named entity recognition the modeling of functional similarity such as co-hyponymymy is centrally important. For further discussion on the effect of the context window size parameter, we refer to Hill et al. (2015) and Levy et al. (2015).

Comparative evaluation
Based on the parameter selection experiments covering three corpora (PMC, PubMed and both), various preprocessing options (normal-text, sentenceshuffled text, lower-cased text), two model architectures (skip-gram vs. CBOW) and six hyper- Table 19: Intrinsic and extrinsic evaluation with comparison to baseline vectors parameters, we selected the best-performing options for comparative evaluation against the baseline vectors (Table 18). Since the size of the context window (win) showed contradictory results between the intrinsic and extrinsic tasks, we created vectors for two different values of this parameter. Note that for this comparative evaluation we use the test sets and test evaluation scripts of the two extrinsic tasks. Table 19 summarizes the results of the comparative evaluation. For our intrinsic tasks, our vectors with win = 30 show the best performance, clearly outperforming the baselines as well as our otherwise identically created vectors with win = 2. This further supports the suggestion that a higher context window facilitates the learning of domain similarity for the intrinsic task. For extrinsic tasks, while the difference to the baselines is smaller, our vectors with win = 2 show the best results for JNLPBA and the second best in BC2GM, while the vectors with win = 30 are clearly less competitive.
The comparative evaluation on test set data thus confirms the indications from parameter selection that the context window size has opposite effects on the intrinsic and extrinsic metrics and indicates that our experiments have succeeded in creating a pair of word embeddings that show state-of-the-art performance when applied to tasks appropriate for each.

Discussion
In this study, we have created vectors with PubMed, PMC and the combination of the two with a large variety of different model, preprocessing and parameter combinations. While in theory a larger corpus is expected to benefit the learning of word representations, we find that in many cases this does not hold, in particular with the combination of PubMed and PMC showing lower results than PubMed alone. We offer two possible explanations for this surprising find-ing, which contradicts some prior in-domain results. First, we used PMC texts recently introduced by PubMed Central using an incompletely documented extraction process, and preliminary examination suggests that the proportion of nonprose text in this material may be quite high, potentially affecting learning. An alternative explanation may be that the word2vec implementation has a (somewhat hidden) "reduce-vocab" function that triggers rare-word removal when the size of the corpus crosses certain thresholds: the larger the corpus size, the more aggressive the trimming. Preliminary results suggests that this functionality may have affected PMC-PubMed, our largest corpus, to a larger extent than the other corpora. We leave the resolution of this question for future work.

Conclusion and future work
In this study, we show how the performance of word vectors changes with different corpora, preprocessing options (normal text, sentence-shuffled text, lower-cased text), model architectures (skipgram vs. CBOW) and hyper-parameter settings (negative sampling, sub sample rate, min-count, learning rate, vector dimension, context window size). For corpora, sentence-shuffled PubMed texts appear to produce the best performance, exceeding that of the notably larger combination with PMC texts.
For hyper-parameter settings, it is evident that performance can be notably improved over the default parameters, but the effects of the different hyper-parameters on performance are mixed and sometimes counterintuitive. We have previously found a similar result in general domain work (with Wikipedia text) (Chiu et al., 2016).
Several directions remain open for future work. First, in addition to tuning individual parameters in isolation, we can study the effect of tuning two or more parameters simultaneously. In addition, the number of training iterations was not considered in the experiments here, and careful tuning of this parameter both separately and jointly with associated parameters such as alpha may offer further opportunities for improvement.