Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization

Word2vec embeddings are limited to computing vectors for in-vocabulary terms and do not take into account sub-word information. Character-based representations, such as fastText, mitigate such limitations. We optimize and compare these representations for the biomedical domain. fastText was found to consistently outperform word2vec in named entity recognition tasks for entities such as chemicals and genes. This is likely due to gained information from computed out-of-vocabulary term vectors, as well as the word compositionality of such entities. Contrastingly, performance varied on intrinsic datasets. Optimal hyper-parameters were intrinsic dataset-dependent, likely due to differences in term types distributions. This indicates embeddings should be chosen based on the task at hand. We therefore provide a number of optimized hyper-parameter sets and pre-trained word2vec and fastText models, available on https://github.com/dterg/bionlp-embed.


Introduction
word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) models are a popular choice for word embeddings, representing words by vectors for downstream natural language processing. Optimization of word2vec has been thoroughly investigated by Chiu et al. (2016a) for biomedical text. However, word2vec has two main limitations: i) out-of-vocabulary (OOV) terms cannot be represented, losing potentially useful information; and ii) training is based on co-occurrence of terms, not taking into account sub-word information. With new entities such as genetic variants, pathogens, chemicals and drugs, these limitations can be critical in biomedical NLP.
Sub-word information has played a critical role in improving NLP task performances and has predominantly depended on feature-engineering. More recently, character-based neural networks for tasks such as named entity recognition have been developed and evaluated on biomedical literature (Gridach, 2017). This has achieved state-ofthe-art performances but is limited by the quantity of supervised training data.
Character-based representation models such as fastText (Bojanowski et al., 2017;Mikolov et al., 2018) and MIMICK (Pinter et al., 2017) exploit word compositionality to learn distributional embeddings, allowing to compute vectors for OOV words. Briefly, fastText uses a feed-forward architecture to learn n-gram and word embeddings, whereas MIMICK uses a Bi-LSTM architecture to learn character-based embeddings in the same space of another pre-trained embeddings, such as word-based word2vec.
Here we evaluate and optimize pre-trained character-based word representations with the fastText implementation for biomedical terms. To compare with word2vec models, we also optimize word2vec by extending the work by Chiu et al. (2016a). We report that fastText outperforms word2vec in all named entity recognition tasks of feature-rich entities such as chemicals and genes. However, in intrinsic evaluation, results and optimal hyper-parameters vary. This is likely due to different entity type distributions within the intrinsic standards. This indicates representations should be selected and optimized based on the task at hand and the entities of interest. We evaluate and provide optimized generalized fastText and word2vec models and models optimized on individual datasets, outperforming a number of current state-of-the-art embeddings.

Data and pre-processing
PubMed 2018 baseline abstracts and titles were parsed using PubMed parser (Achakulvisut and Acuna, 2016), each article was represented as a single line, and any new line characters within an article were replaced by a whitespace. Preprocessing was performed using the NLPre module (He and Chen, 2018). All upper-case sentences were lowered, de-dashed, parenthetical phrases identified, acronyms replaced with full term phrases (e.g. "Chronic obstructive pulmonary disease (COPD)" was changed to "Chronic pulmonary disease (Chronic_pulmonary_disease)), URLs removed, and single character tokens removed. Tokenization was carried out on whitespace. Punctuation was retained. This resulted in a training dataset of 3.4 billion tokens and a vocabulary of up to 19 million terms (Supp. Table  1).

Embeddings and hyper-parameters
Word embeddings were trained on the preprocessed PubMed articles using Skip-Gram word2vec and fastText implementations in gensim (Řehůřek and Sojka, 2010). As in Chiu et al. (2016a), we tested the effect of hyper-parameter selection on embedding performance for each hyper-parameter: negative sample size, subsampling rate, minimum word count, learning rate (alpha), dimensionality, and window size. Extended parameter ranges were tested for some hyperparameters, such as window size. Additionally, we test the range of character n-grams for the fastText models, as originally performed for language models (Bojanowski et al., 2017). Due to the computational cost, especially since fastText models can be up to 7.2x slower to train compared to word2vec (Supp. Figure 1), we modify one hyper-parameter at a time, while keeping all other hyper-parameters constant. Performance was measured both intrinsically and extrinsically on a number of datasets.

Intrinsic Evaluation
Intrinsic evaluation of word embeddings is commonly performed by correlating the cosine similarity between term pairs, as determined by the trained embeddings, and a reference list. We use the manually curated UMNSRS covering disorders, symptoms, and drugs (Pakhomov et al., 2016), and compute graph-based similarity and relatedness using the human disease ontology graph (Schriml et al., 2012) (HDO) and the Xenopus anatomy and development ontology graph (Segerdell et al., 2008) (XADO). 1 million pairwise combinations of entities and ontologies were   randomly computed from each graph and entities which did not map to the ontology map or were multi-token were not considered. Similarity between a pair of terms was computed using the Wu and Palmer (1994) similarity metric, and relatedness was determined by a simplified Lesk algorithm (1986). In the latter, token intersection (excluding stopwords) was calculated between definitions and normalized by the maximum definition length. Pairs which did not have definition statements for any of the terms were excluded. As with UMNSRS, the computed similarity and relatedness scores were correlated with the cosine similarity determined by the embeddings models. As word2vec is not capable of representing OOV words, in literature pair terms which are not in vocabulary are commonly not considered for evaluation. To allow for comparison between the word2vec and fastText models, we represent OOV words as null vectors -as originally performed by Bojanowski et al. (2017). However, to determine the difference in performance of invocabulary word embeddings and OOV word em-beddings, we measure correlation with only invocabulary terms, and with OOV terms pairs considered and null-imputed for word2vec.

Extrinsic evaluation
Intrinsic evaluation by itself may provide limited insights and may not represent the true downstream performance (Faruqui et al. 2016;Chiu et al., 2016b). Therefore, we perform extrinsic evaluation using 3 named entity recognition corpora: (i) the BioCreative II Gene Mention task corpus (BC2GM) (Smith et al., 2008) for genes; (ii) the JNLPBA corpus (Kim et al., 2004) annotating proteins, cell lines, cell types, DNA, and RNA; and (iii) the CHEMDNER corpus (Krallinger et al., 2015) which annotates drugs and chemicals, as made available from Luo et al. (2017). Each of these corpora are originally split into a train, development, and test sets -the same splits and sentence ordering were retained here.
The state-of-the-art BiLSTM-CRF neural network architecture (Lample et al., 2016), as implemented in the anago package, was used to train NER models and predict the development set of each corpus for each parameter. Accuracy was determined by the F-score. Each model was run for up to 10 epochs and the best accuracy on the development set was recorded.

Optimized Embeddings
Hyper-parameters achieving the highest performance for each extrinsic corpus and intrinsic standard were determined for word2vec and fastText. Corpus-specific word2vec and fastText models were trained with the set of optimal hyperparameters for each corpus, as each corpus annotates different entity classes. For a generalized optimal model, we also trained embeddings on optimal hyper-parameters determined across all cor-  Table 6).

General trends: word2vec hyperparameter selection
Overall, intrinsic and extrinsic performance of word2vec models (Figure 1-12) obtained similar trends to Chiu et al. (2016a) for the same corpora/standards (i.e. UMNSRS, BC2GM, and JNLPBA), therefore we refer to Chiu et al. (2016a) for further discussion of these trends. Minor differences were recorded for minimum word count (Figure 7-8) and window size ( Figure 1-2), where both UMNSRS similarity and relatedness   . Table 9).    Chiu et al. (2016a) this was only the case for relatedness. In intrinsic evaluation of window size, particularly UMNSRS (Figure 1), performance consistently increased with increasing window size. This trend was also reported by Chiu et al. (2016a), where the maximum window size of 30 obtained the highest similarity and relatedness. We reasoned that abstracts generally concern a single topic, therefore predicted that increasing the window size to the average abstract length would capture more relevant information. This was indeed the case, obtaining 0.675 and 0.639 for UMNSRS similarity and relatedness respectively, compared to 0.627 and 0.584 similarity and relatedness respectively reported by Chiu et al. (2016a) for PubMed. As higher intrinsic performance was obtained in our results for similar window sizes, the difference in performance is also contributed to by an increase in the training data and different preprocessing.
In the case of extrinsic evaluation, the best performance was generally obtained with lower window size -a similar trend to that reported in Chiu et al. (2016a).

General trends: fastText hyperparameter selection
Except for the character n-gram hyper-parameter, fastText models share the same hyper-parameters with word2vec models. Overall, similar trends in both intrinsic and extrinsic performance were obtained for word2vec and fastText embeddings (Figure 1-12). However, optimal parameters were not necessarily identical, as discussed below.

Intrinsic evaluation
While the overall performance trends with various hyper-parameters for fastText are similar to those obtained by word2vec, we report a number of notable differences. When intrinsically evaluated with UMNSRS, word2vec representations consistently achieved higher similarity and relatedness compared to fastText for hyper-parameters such as: window size, dimensions and negative sampling, irrespective of the selected hyper-parameters. However, evaluating with HDO and XADO intrinsic datasets, results were more variable. fastText tended to perform similar to or outperform word2vec across negative sampling size, dimensions and window size hyper-parameter ranges.
Differences in performance between datasets may be a result of differences in: (i) number of OOV terms; (ii) rarity of terms; and (iii) term types. As UMNSRS is a manually curated reference list of term pairs with the vocabulary of multiple corpora, including PubMed Central, only up   to 9 total tokens were OOV (1.3%; Supp. Table  2). HDO contained up to 5% OOV terms. As OOV terms are represented by null vectors for word2vec models, a decrease in performance with increase in OOV terms is expected. Skipping OOV term pairs from evaluation (rather than imputing) obtained similar performance trends across datasets, indicating that OOV is not the major contributing factor in such intrinsic performance differences. However, this may also imply that fastText degrades the performance for invocabulary terms of the UMNSRS dataset. Similar results were reported by the original authors when assessed on the English WS353 dataset (Bojanowski et al., 2017).
Despite terms being in-vocabulary, the frequency by which these occur in the training dataset may vary. This is indeed the case for UMNSRS and HDO, where UMNSRS has a median rank invocabulary frequency 4 times higher than HDO. This may indicate fastText provides better representations for rarer terms. XADO, however, has a median rank in-vocabulary frequency within 1.3 times of UMNSRS. This implies there are additional contributing factors to such performance differences, including potentially differences in the quality of the ontology graph.
As the intrinsic standards contain various entity classes, differences in representation models' performance (and optimal hyper-parameters) may be dependent on the distribution of entity types. fastText authors reported that fastText outperforms word2vec in languages like German, Arabic, Russian and in rare English words (Bojanowski et al., 2017). This indicates that word2vec and fastText's performance is dependent on the compositionality and word character features, and may therefore be expected to vary between biomedical entity classes.
Biomedical text generally contains terms such as chemicals, genes, proteins and cell-lines which are rich in features such as punctuation, special characters, digits, and mixed-case characters. Such orthographic features have been manually extracted in traditional machine learning methods, or more recently combined with word embeddings, and have been shown to have discriminating power in tasks such as named entity recognition (Galea et al., 2018).

Comparison of representations -Extrinsic evaluation
When performing named entity recognition as extrinsic evaluation of the word representations models, fastText consistently outperformed word2vec at any hyper-parameter value, and consistently across all 3 corpora ( Figures  2,4,6,7,10,12). With 9-13% total OOV tokens, and 14-34% OOV entity tokens (Supp. Table 3, Supp. Fig. 3,4), this indicates the overall likely positive   In terms of the specific corpora, the largest performance difference was recorded for genes (BC2GM) and chemical names (CHEMDNER). As these two corpora only tag one entity type, entity variation is lower than JNLPBA which tags 5 entity classes and therefore this may contribute to the dissimilarities in performance difference between the corpora.
In addition to the rich and unique features, outperformance of fastText in extrinsic evaluation may also be attributed to the standardized nomenclature used in biomedical entities which provides additional within-token structure. For example, systematic chemical names follow the IUPAC nomenclature. Prefixes such as mono, di, and tri indicate number of identical substituents in a compound. Similarly, residual groups are represented by prefixes such as methyl-and bromo-. Additionally, the backbone structure of the molecule is assigned a suffix that indicates structure features (e.g. simple hydrocarbon molecules utilize suffixes to indicate number of single, double or more bonds, where -ane indicates single bonds, -ene double bonds, -ynes triple bonds etc).
With such structure, as fastText is a characterlevel model, for chemicals such as 1,2dichloromethane, most similar words include chemicals which share the substituents and their specific position, defined by the 1,2-dichloro-prefix (Table 1). Therefore, fastText provides more structurally-similar chemicals, whereas word2vec would treat 1,2-dichloromethane and 2dichloromethane as two completely different/unrelated terms (when excluding context or setting a small window size).
As chemicals can be synthesized and named, it is likely for very specific and big molecules such as 1-(dimethylamino)-2-methyl-3,4diphenylbutane-1,3-diol to be OOV. This is a great advantage of character-level embeddings which still enable computing a representation.
Given the highly standardized and structured nomenclature of chemicals, we briefly observed that fastText models are also able to recall structural analogs when performing analogy tasks. For example, methanol → methanal is an oxidation reaction where an alcohol is converted to an aldehyde, specifically the -OH group is converted to a =O group. Given ethanol and performing analogy task vector arithmetic, the aldehyde ethanal is returned. Similar results were observed for sulfu-ric_acid -sulfur + phosphorous, giving phosphor-ic_acid. Formal evaluation on analogy tasks is required to assess how character-based embeddings perform compared to word2vec.
Genes and proteins have full names as well as short symbolic identifiers which are usually acronymic abbreviations. These are less structured than chemical names, however, as the root portion of the symbols represents a gene family, this accounts for the similarity performance of characterbased embeddings. ZNF560 is an example of OOV protein that was assigned a vector close to ZNF* genes (Table 1) as well as SOX1. While SOX1 does not share character n-grams with ZNF560, similarity was determined based on cooccurrence of ZNF genes and SOX1 -genes which are associated with adenocarcinomas (Chang et al., 2015).

Effect of n-grams size
Intrinsic evaluation shows high variability in the range of n-grams between the different standards ( Table 2 & Supp. Table 25). UMNSRS achieves the highest performance (in terms of similarity) with 6-7 n-grams, whereas XADO achieves best results with 3-4 n-grams, and HDO achieves equal performance with ranges: 5-{6,7,8}, 4-6 and 6-8. This indicates the heterogeneity of the terms, both within the reference standards for HDO and XADO, and between standards. This further backs up the difference between the representation models due to entity type differences.
Contrastingly, extrinsic evaluation showed high consistency in n-gram ranges, with all corpora recording highest performance for the ranges 3-7 and 3-8. Within standard error (Supp . Table 23, 24), high performance was also obtained for ranges with lower limit of 2 and 3. Such ranges indicate that both short and long n-grams provide relevant information, complying with the previous discussion and examples for gene nomenclature and chemical naming conventions.

Optimized Models
Word embeddings trained on individual reference standards' optimal hyper-parameters (Supp.  Chiu et al. (2016a), and the more recent 0.681/0.635 by Yu et al. (2017) achieved by retrofitting representations with knowledgebases, but not 0.75/0.73 by MeSH2Vec using prior knowledge (Jha et al., 2017). We expect further improvement to our models by retrofitting and augmenting prior knowledge.
Corpus-optimized fastText embeddings outperformed word2vec across all extrinsic corpora, recording: 79.33%, 73.30% and 90.54% for BC2GM, JNLPBA, and CHEMDNER (Supp. Table 26). This outperforms Chiu et al. (2016a), Pyysalo et al. (2013) and Kosmopoulos et al. (2015), although differences are also due to differ-  Table 3. Intrinsic and extrinsic performance for word2vec and fastText models optimized on optimum hyper-parameters from intrinsic (int) and extrinsic (ex) datasets (Supp . Table 27).  (Luo et al., 2017) -the best performance reported in literature to date. Optimizing word2vec and fastText representations across all corpora and standards (Supp. Table 28) decreased the performance difference in NER between word2vec and fastText. This is due to the differences in the optimal hyper-parameters between intrinsic and extrinsic data (Supp. Table  29). Based on these differences, and as it had been shown that intrinsic results are not reflective of extrinsic performance (Chiu et al. 2016b), we generated separate word2vec and fastText models optimized on intrinsic and extrinsic datasets separately (Table 3). Again, fastText outperforms word2vec in all NER tasks but only outperforms word2vec for the HDO intrinsic dataset, possibly due to similarity implied from disease suffixes captured by n-grams.

Conclusion and future directions
We show that fastText consistently outperforms word2vec in named entity recognition of entities such as chemicals and genes. This is likely to be contributed to by the ability of character-based representations to compute vectors for OOV, and due to the highly structured, standardized and feature-rich nature of such entities.
Intrinsic evaluation indicated that the optimal hyper-parameter set, and hence optimal performance, is highly dataset-dependent. While number of OOV terms and rarity of in-vocabulary terms may contribute to such differences, further investigation is required to determine how the different entity types within the corpora are affected. Similarly, for named entity recognition, investigating the performance differences for each entity class would provide a more fine-grained insight into which classes benefit mostly from fastText, and why.
Empirically, we observed a trade-off between character sequence similarity and context in word2vec and fastText models. It would be interesting to assess how embedding models such as MIMICK, where the word2vec space can be preserved while still being able to generate characterbased vectors for OOV terms, compare.