Subword-level Composition Functions for Learning Word Embeddings

Subword-level information is crucial for capturing the meaning and morphology of words, especially for out-of-vocabulary entries. We propose CNN- and RNN-based subword-level composition functions for learning word embeddings, and systematically compare them with popular word-level and subword-level models (Skip-Gram and FastText). Additionally, we propose a hybrid training scheme in which a pure subword-level model is trained jointly with a conventional word-level embedding model based on lookup-tables. This increases the fitness of all types of subword-level word embeddings; the word-level embeddings can be discarded after training, leaving only compact subword-level representation with much smaller data volume. We evaluate these embeddings on a set of intrinsic and extrinsic tasks, showing that subword-level models have advantage on tasks related to morphology and datasets with high OOV rate, and can be combined with other types of embeddings.


Introduction
Word embeddings are used in many natural language processing tasks (Collobert et al., 2011;Kim, 2014). In word embedding models, words are mapped or "embedded" into low-dimensional real-valued vectors. Such mapping is based, implicitly or explicitly, on word co-occurrence statistics (Levy and Goldberg, 2014b).
Naturally, frequent words provide a better representation of their distributional properties; thus the quality of word embeddings is in direct relation to the frequency of words (Drozd et al., 2015). However, even in large corpora, most words occur very few times. For example, Baroni (2009) shows that the words occurring 3 times or less constitute almost 70% of the vocabulary. Consequently, most of the in-vocabulary words (for a given task/corpora) have to be discarded or embedded into low-quality vectors. Therefore, wordlevel models suffer from data sparsity. Another issue with word-level models is that they do not make use of morphological information. Different forms of the same word are treated as completely unrelated entities. For example, as shown in Section 4.2, we find that the word "physicist" and "physicists" are not close to each other in a well-know word embedding model Skip-Gram (Mikolov et al., 2013).
These two issues are addressed by the emerging methodology of subword-level representations, as discussed in Section 2. The most notable example of such representations is FastText (Bojanowski et al., 2017). It represents each word as a bagof-character n-grams. Representations for character n-grams, once they are learned, can be combined (via simple summation) to represent out-ofvocabulary (OOV) words.
This paper contributes to the discussion of composition functions for constructing subword-level embeddings and their evaluation. We propose and evaluate several models (including convolutional and recurrent neural networks) that can embed arbitrary character sequences into vectors. Our models do not rely on any external resource. We also propose a hybrid training scheme, which makes these neural networks directly integrated into Skip-Gram model. We train two sets of word embeddings simultaneously: one is from a lookup table as in traditional Skip-Gram, and another is from convolutional or recurrent neural network. The former is better at capturing semantic similarity. The latter is more focused on morphology and can learn embeddings for OOV words. We conduct experiments on five tasks, and compare our models with original Skip-Gram and the state-ofthe-art performer FastText.

Morphology-based Models
Morphology has long been considered as an important feature for word representations. For example, Lazaridou et al. (2013) investigate several algebraic composition functions (e.g. addition or multiplication) for morphologically complex words, which generate better representations compared to traditional distributional semantic models. Luong et al. (2013) train a recursive neural network for morphological composition, and show its effectiveness on (rare) word similarity task. Qiu et al. (2014) propose Morpheme CBOWs for word similarity and word analogy tasks, which improves on CBOW model (Mikolov et al., 2013) by learning morphology embeddings and word embeddings simultaneously. Alexandrescu and Kirchhoff (2006) take morphological tags as features (one-hot representation) for training a language model, which reduce the perplexity on rare word language modeling scenarios.
For both language modeling and machine translation tasks, LBL++ Botha and Blunsom (2014) show the effectiveness of summing morphology vectors in log-bilinear model (Mnih and Hinton, 2007) on 6 morphologically rich languages. Similarly, Morph-LBL (Cotterell and Schütze, 2015) improves on LBL model by predicting both context words and words' morphological tags in a semi-supervised fashion, which outperforms both Word2Vec and LBL on German morphological analysis.
However, all the above models rely on prior morphological knowledge, which is obtained by morphology analysis tools such as Morfessor (Creutz and Lagus, 2007), or an annotated morphology corpus such as CELEX (Baayen et al., 1995) and TIGER (Brants et al., 2004).

Subword-level Word Embeddings
Another line of work is focused on end-to-end word embedding learning based on subword-level information. FastText (Bojanowski et al., 2017) is probably the most influential and effective recent model. It represents each word as a bag-ofcharacter n-grams. The models proposed in this paper are conceptually derived from FastText, i.e. we also operate on character n-grams level and predict context words from the target word, as in Skip-Gram approach.
Similarly to our proposed RNN, Cao and Rei (2016) train a bi-directional LSTM based on subword information. Instead of using character ngrams, their model feeds the word's prefixes and suffixes into each direction of LSTM respectively. This model is mainly designed to solve morphological boundary recovery task, it performs comparably with dedicated morphological analyzers. Pinter et al. (2017) also utilize BiLSTM to construct word embeddings. However, their model relies on pre-trained word embeddings by minimizing the squared Euclidean distance. In contrast, our proposed RNN requires only a plain text corpus.

Other Subword-level Models
There are also various task-specific NLP models that utilize character-level information for training deep neural networks in an end-to-end fashion. They often surpass the word-level baselines on language modeling (Mikolov et al., 2012;Sperr et al., 2013;Bojanowski et al., 2015;Kim et al., 2016), part-of-speech tagging (Ling et al., 2015;dos Santos and Zadrozny, 2014), text classification (Zhang et al., 2015), and machine translation (Sennrich et al., 2016;Luong and Manning, 2016), etc. However, these models do not produce representations that could be used in other tasks.

Skip-Gram
Due to its popularity, simplicity, and state-of-theart performance on a range of linguistic tasks, Skip-Gram (Mikolov et al., 2013) has been widely used as baseline in the word embedding literature (Levy and Goldberg, 2014a;Faruqui et al., 2015;Bojanowski et al., 2017;. In particular, we use Skip-Gram with negative sampling technique (Figure 1-a). For a vocabulary V of size |V |, Skip-Gram learns two set of vectors   W, C ∈ R |V | * N , namely word vectors and contextual word vectors. N is the dimension of vectors. Given a training corpus, Skip-Gram iterates through all words w and their contexts c, and maximizes the objective function p(c| w), which is defined as: where σ is the sigmoid function. w ∈ W and c ∈ C are the vectors for word w and context c respectively. K is the negative sampling size. N w,c is the negative example that sampled from the vocabulary V . The negative sampling probability is empirically defined as the unigram probability of a word raised to the power of 3/4.

Utilizing Subword Information
In order to make use of subword information, we first generalize the objective function of Skip-Gram by replacing word vector w with a composition function f (w). f (w) takes word w as an input and outputs a vector of length N . Overall, the objective function of generalized Skip-Gram is defined as: In the original Skip-Gram model, the function f (w) is simply a lookup table, which projects word w to its corresponding vector w in the table. Depending on the definition, f (w) could also take the subword information of w into consideration. Naturally, composition functions, especially neural networks, can be considered.

A Hybrid Training scheme for Subword-level Models
Intuitively, compared to Skip-Gram, subwordlevel models should be better suited for capturing morphology instead of similarity. In order to take advantage of both representations, we incorporate Skip-Gram into subword-level models. Formally, in our hybrid training scheme we define the objective function of subword-level models as: In this case, each word will have two embeddings: one from the lookup table (the same as Skip-Gram), and another from the composition function f . These two types of embeddings are learned simultaneously. We denote the embedding model from the word-level lookup table as Model word , and the one from composition function as Model subword . As a baseline we additionally train embeddings using only subword-level composition function; these models will be referred to as Model vanilla .

FastText (Summation)
Probably the most simple and intuitive way of utilizing subword information is to sum all vectors of characters and character ngrams belonging to a word ( Figure 1-b), which is pioneered by Bojanowski et al. (2017). Formally, the composition function is defined as f (w) = g∈Gw g, where g is the character n-gram, and g is its corresponding n-gram vector with length N . G w is the set of character n-grams for word w. For example, when n = 3, G w for word "bigger" is defined as <bi, big, igg, gge, ger, er>. The angle brackets are padding at the start and end of the word.
Note that the original FastText (FastText vanilla ) does not have the hybrid training scheme. As later shown in our experiments, FastText word from the hybrid training scheme works better than FastText vanilla in some semantic relatedness datasets. Moreover, hybrid training scheme is essential for other types of composition functions.
For fair comparison, in the following models, we use the same padding and the same length of character n-grams' vector.

Convolutional Neural Network
Despite its simplicity and efficiency, there is no clear evidence that simple summation as in Fast-Text is the best choice for composing subword information.
This paper investigate two neural network. We first consider the Convolutional Neural Network (CNN) as composition function f (w). CNN (Le-Cun, 1998) is able to capture local features automatically, and has been applied to a wide range of NLP tasks (Kim et al., 2016;Zhang et al., 2015;Luong and Manning, 2016).
The CNN architecture introduced in this paper is inspired by the model used for language modeling in Kim et al. (2016). As illustrated in Figure 1c, similarly to FastText, the vectors of characters are first extracted from a lookup table. Those vectors form a matrix with size N * L, where L is the number of characters. The 1D convolution filters are used to extract local features. We apply 1D convolutions of size ranging from 1 to 7 in parallel, perform max-pooling and concatenate the output. Each of the convolutions uses 200 filters. The output of this model is a fully-connected layer with the number of units corresponding to the desired size of embeddings. The resulting vector is used for predicting contextual words using negative sampling, the same as the negative sampling in Skip-Gram.

Recurrent Neural Network
Another neural network that is worth considering is Recurrent Neural Network (RNN). It takes a sequence of arbitrary length as an input, and outputs a vector that represents this sequence. Among all the different variations, the Long Short-Term Memory based recurrent neural network (LSTM) (Hochreiter and Schmidhuber, 1997) and its bidirection version (Schuster and Paliwal, 1997) are easier to train, and better capture long-distance information. As illustrated in Figure 1-d, for each direction, an LSTM runs over all the vectors of characters in the word w. The hidden layers at each position are then summed together, and the resulting vector is fed into a fully connected layer to form the final vector w. We empirically set the hidden layer size of LSTM to N * 2.

Implementation Details
We implemented all models described in section 3 using Chainer deep learning framework (Tokui et al., 2015). Since the proposed CNN and RNN architectures significantly increase computational requirements for training, we choose a relatively small TEXT8 1 corpus for this evaluation. This corpus contains the first 10 9 bytes of the English Wikipedia dump from Mar. 3, 2006. The word embedding size N is set to 300. The batch size is fixed to 1000. The negative sampling size is set to 5, and the window size is set to 2. Following Chainer's original word2vec implementation, we use Adam (Kingma and Ba, 2014) as the optimization function. Words which appear fewer than five times are directly discarded, which results in vocabulary size of 71290. For character ngrams, we follow the FastText's best configuration and use 5-grams for FastText. We also discard character ngrams which appear fewer than five times, which results in character ngrams vocabulary size of 143207. Word embedding models are trained for 5 epochs on Nvidia Tesla K80 or P100 GPU. We also download the state-of-the-art FastText embeddings (denoted as FastText external ) 2 , which are trained on a much larger full Wikipedia corpus. Note that since CNN and RNN require approximate 45 and 65 days of training on K80, we didn't train on this corpus.
For the fair comparison, we ensure that all embeddings used in evaluations use exactly same vocabulary. Unlike most of the benchmarks, where only embeddings of encountered words affect resulting accuracy, analogical reasoning benchmark is sensitive to the entire vocabulary in terms of size and embeddings of individual words. For exam-1 http://mattmahoney.net/dc/textdata. html 2 https://github.com/facebookresearch/ fastText ple it is hard to make a mistake when looking for "Paris" as the pair for "France" if the whole vocabulary contains only these two words. Furthermore, accuracy depends on how close the target word is to the source words (Rogers et al., 2017), which could also be affected by a larger vocabulary.
This issue is especially pronounced for embeddings with dynamic vocabulary, such as subwordlevel models evaluated in this study. In our pilot experiments, models with large vocabulary like FastText external result in poor performance on word analogy tasks since large vocabulary increases the number of options.  Note that the sizes of these models are different, as shown in Table 2. Due to the large number of character ngrams, FastText requires the most data among these models, while CNN needs only a few megabytes of data.

Qualitative Analysis
Before looking into the performance of models on specific tasks, we first conduct qualitative analysis. We choose several target words and analyze their nearest neighbors in Skip-Gram, FastText subword , CNN subword , and RNN subword . We find that subword-level models, especially CNN-and RNN-based models, tend to cluster words with the same morpheme together. Taking target word "physicists" as an example (Table 1), the word "physicist" is within the top-10 nearest neighbors in subword-level models, but not in  Table 3: Results on word similarity and word analogy datasets. For hybrid training scheme, we denote the embeddings that come from word vector lookup table as "Model word ", and the embeddings which come from the composition function as "Model subword ". We denote the vanilla (non-hybrid) models as "Model vanilla ". The "FastText external " is the public available FastText embeddings, which are trained on the full Wikipedia corpus. We also test the version where OOV words are expanded, and denote as "Model +OOV ". Model combinations are denoted as gray rows , and best results among them are marked bold. Rare words dataset in blue column have 43.3% OOV rate, while other word similarity datasets have maximum 4.6% OOV rate. Morphology related categories are denoted as almond columns . The results of model combination for word similarity task are simply the average of results from each single models, which are not listed in this table.
Skip-Gram. Moreover, subword models tend to cluster words with the same morphology form (affix) together, especially for RNN and CNN.

Word Similarity
Word similarity task aims at producing semantic similarity scores of word pairs, which are compared with the human scores using Spearman's correlation. The cosine distance is used for generating similarity scores between two word vectors. In order to test the effectiveness of capturing word similarity for rare word, we choose the Rare Words dataset (Luong et al., 2013). For systematical comparison, we also test our models on the WordSim353 (WS) (Finkelstein et al., 2001) dataset, divided into similarity (sem.) and relatedness (rel.) categories (Zesch et al., 2008;Agirre et al., 2009), Sim 999 dataset (Hill et al., 2016), MEN dataset (Bruni et al., 2012), and Mech Turk dataset (Radinsky et al., 2011). Table 3 shows that on word similarity tasks FastText models perform the best in all datasets except Sim 999. CNN subword and RNN subword are more focused on word morphology, and thus do not perform well on word similarity task.
-ment re--ize -ist -less -y in-un--ion -al -ly -ful -ic -ity -able -ous -ness -er However, compared to Skip-Gram, CNN word and RNN word (the versions with word vector lookup table) achieve comparable or even better results. Their word vector lookup tables in these models are affected by the composition function, which results in better performance. Note that the Rare Words dataset has 43.3% of words which are OOV. In this dataset, the vocabulary expanded models (FastText subword+OOV , CNN subword+OOV , and RNN subword+OOV ) perform a lot better than others. This highlights the necessity of expanding vocabulary and the effectiveness of subword-level models.

Word Analogy
The word analogy task aims at answering questions generalize as "a is to a' as b is to ?", such as "London is to Britain as Tokyo is to Japan". We follow the evaluation protocol by Drozd et al. (2016) who proposed the LRCos method of solving word analogies, which significantly improves on the traditional vector offset method. We use Google analogy dataset (Mikolov et al., 2013) along with a much bigger and balanced BATS analogy dataset .
On word analogy datasets (Table 3), the inflectional and derivational morphology categories demonstrate the effectiveness of subword-level word models. It is especially obvious on derivation morphology category, where Skip-Gram only achieves 9.6% accuracy and subword-level models achieve minimal 57.8% accuracy (excluding the lookup table versions). Furthermore, when the vocabulary is expanded, the minimal accuracy of subword-level models reaches 82.4%.
Morphology-related analogy questions also benefit a lot from the model combination. No-tably, Concat subowrd+OOV achieves an accuracy of 96.2% and 96.0% on inflection and derivation morphology, which is by far the highest accuracies on this two categories. We also observe that CNN is less sensitive to semantic word analogy, while performing the best on derivational word analogy.

Affix Prediction
In this section we test the ability of subwordlevel embeddings to predict what affix is present in a morphologically complex word. We use the dataset gathered by (Lazaridou et al., 2013), which contains 6549 stems and derived word pairs, such as "name"-"rename" and "sparse"-"sparsity". There are 18 affixes, such as "re-" and "-ity", and the task is to predict which one is present in a given word. We use the embeddings of derived words as input, and feed them to a logistic regression classifier for predicting their affixes. The accuracy, recall, and F 1 -score are used for measurement. We also follow Lazaridou et al. (2013) in using the default training/test data split. Figure 3 shows a t-SNE projection of the words with different affixes in the dataset. It is clear that both CNN and RNN are able to distinguish different derivation types, with the advantage of the former. This also confirms the good performance of CNN on derivational analogy task. Note that Fast-Text does not fare much better than Skip-Gram, although it is a subword-level model. This partially explains its low accuracy compared to other subword-level models on morphological analogy categories.
The prediction results in Table 4 reflect the cluster visualization in Figure 3. Moreover, as in the word analogy task, the concatenation (especially with the expanded vocabulary version) performs  the best among all the other models.

Sequence Labeling
Sequence labeling task consists in assigning labels to elements of texts. We evaluate word embedding models on Part-of-Speech Tagging (POS), Chunking 3 and Named Entity Recognition (NER) tasks 4 . Following the evaluation protocol used in Kiros et al. (2015); , we restrict the predicting model to Logistic Regression Classifier 5 . The classifier's input for predicting the label of word w i is simply the concatenation of word vectors w i−2 , w i−1 , w i , w i+1 , w i+2 . This ensures that the quality of the embedding models is directly evaluated, and their strengths and weaknesses are easily observed. Subword-level models on sequence labeling tasks clearly demonstrate the effectiveness of expanding OOV words. As shown in Table 4, expanding vocabulary boosts the performance by a large margin.

Text Classification
For text classification task, we choose the movie review sentiment (MR) (Pang and Lee, 2005), customer product reviews (CR) (Nakagawa et al., 2010), subjectivity/objectivity classification (SUBJ) (Pang and Lee, 2004), and IMDB movie review (IMDB) (Maas et al., 2011) datasets. The classification is performed by Logistic Regression Classifier. The input of this classifier is the sum of word embeddings that belonging to the text.
As shown in Table 5, the input word embeddings do not considerably affect the final accuracy. This is especially obvious when comparing subword and subword + OOV models. It's hard to draw any insightful conclusion from this experiment. This is in line with the observations of , who showed that Skip-Gram, CBOW, and GloVe trained with different context types perform similarly on text classification task.

Discussion
The evaluation showed that despite being trained on a relatively small corpus, CNN-and RNNbased (model-based) approaches outperform conventional (trained on word-level) embeddings as well as FastText embeddings, which are claimed to better capture morphological information (Bojanowski et al., 2017). In some cases, the performance of model-based embeddings is even higher than that of FastText embeddings that were trained on much larger corpus.
Moreover, such performance is achieved with very compact representations: it is possible to generate an embedding for any given word on demand with only network weights, and these weights require an order of mere kilobytes of data (Table 2). Naturally, these compact representations do not have enough capacity to capture semantic information well. However, besides merely indicating a possibility for improvement/limitations of more "heavy-weight" models (like original Skip-Gram or FastText), model-based embeddings can be used together with other approaches to improve their sensitivity to morphological information.
One such approach is to combine embeddings from different models after training, as demonstrated in our experiments ("concat" lines in Table 3). Simple concatenation of lookuptable-based and model-based embeddings maintain high accuracy on morphology-related benchmarks, while elevating performance on semantic tasks to comparable level.
Additionally, subword-level models can be trained jointly with models based on lookuptables, which improves their performance on different tasks. After training, either part can be used independently or jointly (e.g. by concatenation) in down-stream tasks.

Conclusion
We have implemented and evaluated several types of composition functions for subword-level elements (characters and character n-grams) in the context of training word embeddings in Skip-Gram-like model.
We have shown that morphological information can be captured efficiently by extremely compact  models. Embeddings generated dynamically from just a few megabytes of parameters significantly outperform conventional (word2vec and FastText) models on morphology related tasks. Additionally, this indicates the vast limitation of the ability of conventional models to capture morphological information.
To model both morphological and semantic information, we propose two methods for combining strength of compact subword-level-and lookuptable based models: merging trained embeddings and training jointly. The resulting embeddings achieved high accuracy on a range of benchmarks and are particularly promising for datasets with high OOV rate.
The source code of those models, along with the pretrained word embeddings, has been integrated into an open-source project, and will be publicly available.