Generalizing Word Embeddings using Bag of Subwords

We approach the problem of generalizing pre-trained word embeddings beyond fixed-size vocabularies without using additional contextual information. We propose a subword-level word vector generation model that views words as bags of character n-grams. The model is simple, fast to train and provides good vectors for rare or unseen words. Experiments show that our model achieves state-of-the-art performances in English word similarity task and in joint prediction of part-of-speech tag and morphosyntactic attributes in 23 languages, suggesting our model’s ability in capturing the relationship between words’ textual representations and their embeddings.


Introduction
Word embeddings have been an essential part of neural-network based approaches for natural language processing tasks (Goldberg, 2016). However, many popular word embeddings techniques have a fixed vocabulary (Mikolov et al., 2013;Pennington et al., 2014), i.e., they can only provide vectors over a finite set of common words that appear frequently in a given corpus. Such methods fail to generate vectors for rare words and words not present in the training corpus, but appearing in the test corpus or downstream task texts, raising difficulty for any methods relying on word vectors to efficiently extract useful features from text. This is often referred to as the out-ofvocabulary (OOV) word problem. We aim to address this problem by inferring vectors for OOV words with only access to pre-trained vectors over a fixed vocabulary of common words and the OOV word itself without context.
The motivations come from both linguistics and natural language processing applications. First, from a linguistic view a word can be decomposed into multiple morphemes: stems, affixes, modifiers and etc. This is more often the case for rare words. In some field such as chemistry and agglutinative languages such as Turkish, there exists a systematic way of composing words from morphemes. Some can even be arbitrarily long.
Apart from the explicit and systematic way of making words, we can also observe the ability of a language speaker to infer the meaning of an unseen word. For instance, one can guess that "pre-EMNLP" means "before EMNLP", even without the presence of any context, suggesting that it is part of our implicit linguistic knowledge to infer meaning of an unseen word solely from its lexical form. This observation, together with the morpheme decomposition of many rare words, implies the feasibility of inferring their vectors from those for common words, and also raises the algorithmic question of how to compute them efficiently.
Second, there are many NLP applications where estimating word embeddings of OOV is critical. For instance, in the case of analyzing Twitter data, while there exists pre-trained word embeddings with giant vocabularies trained on massive number of tweets, such as GloVe vectors (Pennington et al., 2014), this would still not cover new words coined by users everyday. In such cases, it is more prudent to extend the available pre-trained vectors trained on very large corpora, so that we can estimate embeddings for OOV words, instead of retraining a new word / subword level embedding model on the new extended data corpus.
OOV words have always been a problem for methods that assume fixed vocabularies. A common workaround is to view all OOV words as a special UNK token and use the same vector for all of them. This would restrict any downstream models from accessing distinct features of those words. Thus, we would like a method to provide vectors that capture semantic and grammatical fea-tures even for OOV words. We also would like such method to maximally rely on the word itself, instead of its context, as contextual information is already used later with sentence level models stacking over word vectors.
To achieve this, we aim to build a word embedding model that generalizes pre-trained word embeddings to OOV words. First, given word embeddings for a fixed vocabulary, our model learns the relationship between the subwords present in each word and its corresponding pre-trained word vector. Then, using the learned subword information, our model can generate word embeddings for any word, regardless if it is OOV or not.
Contribution We propose a simple yet effective subword-level word embedding method that can be efficiently trained given pre-trained word vectors for a limited number of words. Once trained, our embedding model takes the characters n-grams in a word as input and gives its word vector as output. 1 Our experiments on word similarity tasks in English and POS tagging in a variety of languages suggests that the proposed word embedder is able to mimic and generalize consistently the word vectors from in-vocabulary words to out-of-vocabulary words, and achieves state-ofthe-art scores for the tasks compared to previous subword-level word embedders trained under the same setting. This gives evidence that such a simple model is capable of capturing language speaker's morphological knowledge, and also provides an easy way to generate word vectors for rare or unseen (OOV) words with potential application to various natural language processing tasks.
Related work There exist a large body of works that try to incorporate morphological information into word representations, e.g., (Alexandrescu and Kirchhoff, 2006;Luong et al., 2013a;Qiu et al., 2014;Botha and Blunsom, 2014;Cotterell and Schütze, 2015;Soricut and Och, 2015). These approaches typically rely on the morphological decomposition of words. Some other approaches using subword information do not rely on morphological decomposition but requires context information from large text corpus (Schütze, 1993;Santos and Zadrozny, 2014;Ling et al., 2015;Wieting et al., 2016).
In particular, Bojanowski et al. (2017) introduced fastText, a word embedding method enhanced with subword (character n-gram) embeddings. They are able to generate vectors for OOV words, which has been shown useful for text classification (Joulin et al., 2016), but the model is to be trained over large text corpus. Pinter et al. (2017) use a character-level bidirectional LSTM model called MIMICK, mapping from word strings to word vectors. The idea of using character-level recurrent neural networks (RNNs) for word vectors is not new (Ling et al., 2015;Plank et al., 2016), but as per authors' knowledge, they are by far the only attempt to the exact task of generalizing word vectors from only pre-trained vectors with a fixed vocabulary, i.e. with no access to contextual information.

Bag-of-Substring Model
Our Bag-of-Substring (BoS) word vector generation model views a word as a bag of its substrings, or character n-grams. Specifically, we maintain a vector lookup table for each possible substrings (or character n-grams) of length between l min and l max . A word vector is then formed as the average of vectors of all its substrings with lengths in the range. Let Σ be the finite set of characters in the language, subs b a (s) = {t is substring of s : a ≤ |t| ≤ b} for string s ∈ Σ * be the set of substrings of s whose length is between a and b inclusive, and <s> be the concatenation of character <, string s and character > where <, > ∈ Σ. The BoS embedding for a string/word s can be expressed as where V ∈ R d×(|Σ| l min +···+|Σ| lmax ) are the parameters which stores the embeddings of dimension d for each possible substring of length between l min and l max , v t is the vector in V indexed by t, S <s> is a shorthand for subs lmax l min (<s>). Special characters <, > ∈ Σ are used to mark the start and the end of the word and thus help the model to distinguish homographic morphemes that occur at different word parts, e.g. prefixes or suffixes. An example BoS representation for word infix is subs 4 3 (<infix>) = {<in, <inf, inf, infi, nfi, nfix, fix, fix>, ix>}.
fastText (Bojanowski et al., 2017) uses the same idea for their word vector generation part. How-ever, unlike them, we train the model directly towards pre-trained vectors, instead of via context prediction over text corpora.
Training Given pre-trained vectors for a set of common words, our model views them as targets and is trained to fit these targets. Once the parameters (the vectors v t for the substrings) are learned, the model can then be used to infer vectors for rare words. Let U ∈ R d×|W | be the target vectors of the same dimension d over finite vocabulary W ⊂ Σ * . Our model is trained by minimizing the overall loss between the generated and the given vectors for each word: where the loss function l(v, u) = 1 2 v − u 2 2 , namely the mean squared loss.
After training, one can use the learned V and Eqn (1) to compute the vector for any given word, even if it is OOV.
Hyperparameters We set the following hyperparameters for all the experiments. For BoS model, l min = 3 and l max = 6 following Bojanowski et al. (2017). Note that under this setting, S <s> can never be empty for non-empty string s. For optimization, stochastic gradient descent with learning rate 1 for 100 epochs. The dimension of the word vectors is not a hyperparameters here as it needs to agree with the target vector.

Word Similarity
We run experiments to quantitatively evaluate the our model's generalizability towards OOV words.
The word similarity task asks to predict word similarity between a pair of two words. Given a set of pairs of words and gold labels for their similarities, the performance of word embeddings is measured by the correlation between the gold similarities and the similarities induced by the generated embeddings. And we can thus imply how good our model is at generating word vectors. The word similarity here is computed using the cosine distance between the two word vectors, and the correlation is computed using Spearman's ρ.   so we use it to access our model's ability to generalize word embeddings to OOV words. WS is composed of mostly common words and we use it to test if our subword-level models successfully mimic the target vectors.

Target vectors
We train our BoS model over the English Polyglot vectors 2 to establish a direct comparison with results from MIMICK (Pinter et al., 2017), and as well as the Google word2vec vectors 3 which are popularly used in NLP tasks. Polyglot (Al-Rfou et al., 2013) is a multilingual NLP dataset, which also provides pre-trained word vectors over each language's corpus with a vocabulary of 100,000 most frequent words. For Google vectors, most of their vocabulary consists of nonwords such as URLs and phrases, so we normalize tokens into ASCII characters by taking off all the diacritics and take only tokens consisting of a single word with all lower letters. Statistics of the processed vectors are summarized in Table 1, along with their word similarity task scores (for in-vocabulary words only) and OOV rate over the aforementioned evaluation sets.
Baselines We compare the scores with other subword-level models (fastText and MIMICK) and word similarity induced by non-parametric edit distance (EditDist). fastText (Bojanowski et al., 2017) uses the same subword-level character n-gram model but is to be trained via context prediction over large text corpora (here English Wikipedia dump 4 ). MIM-ICK (Pinter et al., 2017) is a character-level bidirectional LSTM word embedder trained against pre-trained word vectors (here Polyglot vectors 5 ).
Edit distance is defined between two strings as the smallest number of modifications: adding, deleting and changing one character, needed to turn one string into the other. It can be computed using dynamic programming in O(|s 1 | × |s 2 |) time. The word similarity between w 1 and w 2 here is the edit distance normalized by the length of the longer word: where d edit is edit distance.
Results Results are summarized in Table 2. When trained over Polyglot vectors, our BoS model works better than EditDist and MIMICK. When trained on Google vectors, the correlation scores are almost as good as those of fastText, the state-of-the-art subword level word embedder. However, unlike fastText, our model does not have access to word contexts in a large text corpus for training. In both cases, the significant differences of scores compared to those of EditDist, suggest that our model indeed learns to capture semantic similarities between words, rather than superficial similarities in spelling.
Comparing to MIMICK, our model is able to fill up 81% (14 to 36 against 41) and 73% (12 to 36 against 45) of the gaps in scores over RW and WS respectively. This improvement is more significant on RW with most (58%) of its words are OOV for the PloyGlot vectors, suggesting our model's power in generating consistent word vectors for OOV words. Surprisingly MIMICK performs no better than the edit distance baseline when evaluated on RW. Combined with the fact that it does no better for WS which has a near-zero OOV rate, it suggests MIMICK's limited power of generalizing word vectors towards OOV words, or even reproduce consistent word vector for in-vocabulary words. As a sanity check, we see that all of the embedder models scores obviously better than Ed-itDist when evaluated over common words (WS), showing that all of them are able to at least remember or mimic the word vectors for in-vocabulary words.
Also note that our model is fast to train. With a naive single-thread CPU-only Python implemen-5 https://github.com/yuvalpinter/Mimick tation, it can finish 100 epochs of training over English PolyGlot vectors within 352 seconds on a machine with an Intel Core i7-6700 (3.4 GHz) CPU, 32GB memory and 1TB SSD. Compared to fastText which, with a fast multithread C++ implementation, takes hours to be trained over giga bytes of text corpus, our method provides a cheap way to generalize reasonably good word vectors for OOV words.

Joint Prediction of Part-of-Speech Tags and Morphosyntactic Attributes
Besides word similarity, we try to access our embedders' ability of capturing words' syntactic and semantic features by evaluating with the task of predicting part-of-speech (POS) tags and morphosyntactic attributes for words in a sentence. For each word in a given sentence, the task asks for a POS tag and a label for each applicable morphosyntactic category, such as gender, case or tense.
Dataset We use Universal Dependencies (UD) dataset (Petrov et al., 2012) for this task. UD is an open-community effort to build consistent annotated treebank cross many languages. We pick the specific version 1.4 to enable a direct comparison with Pinter et al. (2017). Since we use Poly-Glot vectors to train our word embedders, we conduct experiments on the 23 languages that appear in both Polyglot and UD 1.4.
Model We adopt the same sentence-level bidirectional LSTM model from Pinter et al. (2017) for the joint prediction of both labels. Given a sentence as a sequence of words, we first embed each word using the word embedder we choose and then fed the embeddings into the LSTM. The output of LSTM is then used to predict POS and morphosyntactic tags. We emphasize the difference in the setting that we fix the word embeddings during the training, as to better evaluate the ability and consistency of the embeddings in capturing words' semantics and syntactics, rather than LSTM's ability to memorize words and infer the role of words from their context.
We use the same set of hyperparameters for the LSTM model as Sec. 5.3 in Pinter et al. (2017) and train the model for 20 epochs for each language. The BoS and MIMICK word embedders  are trained beforehand with PolyGlot dataset using the same way described earlier.

Results
The POS tagging accuracies and micro F1 scores for morphosyntactic attributes are reported in Table 3 with word vectors generated by different models. The BoS and MIMICK model here are trained against Polyglot vectors. As a comparison, we include the results using random word vectors of the same dimension (64). Our BoS model shows steady and significant gain compared to MIMICK embeddings for both tasks in all languages. We especially observe the greatest margins for agglutinative languages such as Turkish and Indonesian, and in Germanic languages English, Swedish and Danish, suggesting that our model learns stable representations for morphemes to consistent word type signal.

Conclusion
We proposed a subword-level word embedding model and a word vector generalization method that enables extending pre-trained word embeddings with fixed size vocabularies to estimate word embeddings for out-of-vocabulary words. Intrinsic evaluation on word similarity tasks and extrinsic evaluation on POS tagging task demonstrate that our model captures morphological knowledge and generates good estimates of word vectors for OOV words.