Incorporating Subword Information into Matrix Factorization Word Embeddings

The positive effect of adding subword information to word embeddings has been demonstrated for predictive models. In this paper we investigate whether similar benefits can also be derived from incorporating subwords into counting models. We evaluate the impact of different types of subwords (n-grams and unsupervised morphemes), with results confirming the importance of subword information in learning representations of rare and out-of-vocabulary words.


Introduction
Low dimensional word representations (embeddings) have become a key component in modern NLP systems for language modeling, parsing, sentiment classification, and many others.These embeddings are usually derived by employing the distributional hypothesis: that similar words appear in similar contexts (Harris, 1954).
The models that perform the word embedding can be divided into two classes: predictive, which learn a target or context word distribution, and counting, which use a raw, weighted, or factored word-context co-occurrence matrix (Baroni et al., 2014).The most well-known predictive model, which has become eponymous with word embedding, is word2vec (Mikolov et al., 2013a).Popular counting models include PPMI-SVD (Levy et al., 2014), GloVe (Pennington et al., 2014), and LexVec (Salle et al., 2016b).
These models all learn word-level representations, which presents two main problems: 1) Learned information is not explicitly shared among the representations as each word has an independent vector.2) There is no clear way to represent out-of-vocabulary (OOV) words.
fastText (Bojanowski et al., 2017) addresses these issues in the Skip-gram word2vec model by representing a word by the sum of a unique vector and a set of shared character n-grams (from hereon simply referred to as n-grams) vectors.This addresses both issues above as learned information is shared through the n-gram vectors and from these OOV word representations can be constructed.
In this paper we propose incorporating subword information into counting models using a strategy similar to fastText.
We use LexVec as the counting model as it generally outperforms PPMI-SVD and GloVe on intrinsic and extrinsic evaluations (Salle et al., 2016a;Cer et al., 2017;Wohlgenannt et al., 2017;Konkol et al., 2017), but the method proposed here should transfer to GloVe unchanged.
The LexVec objective is modified2 such that a word's vector is the sum of all its subword vectors.
We compare 1) the use of n-gram subwords, like fastText, and 2) unsupervised morphemes identified using Morfessor (Virpioja, 2013) to learn whether more linguistically motivated subwords offer any advantage over simple n-grams.
To evaluate the impact subword information has on in-vocabulary (IV) word representations, we run intrinsic evaluations consisting of word similarity and word analogy tasks.The incorporation of subword information results in similar gains (and losses) to that of fastText over Skip-gram.Whereas incorporating n-gram subwords tends to capture more syntactic information, unsupervised morphemes better preserve semantics while also improving syntactic results.Given that intrinsic performance can correlate poorly with performance on downstream tasks (Tsvetkov et al., 2015), we also conduct evaluation using the Ve-cEval suite of tasks (Nayak et al., 2016), in which all subword models, including fastText, show no significant improvement over word-level models.
We verify the model's ability to represent OOV words by quantitatively evaluating nearestneighbors.Results show that, like fastText, both LexVec n-gram and (to a lesser degree) unsupervised morpheme models give coherent answers.

Related Work
Word embeddings that leverage subword information were first introduced by Schütze (1993) which represented a word of as the sum of four-gram vectors obtained running an SVD of a four-gram to four-gram co-occurrence matrix.Our model differs by learning the subword vectors and resulting representation jointly as weighted factorization of a word-context co-occurrence matrix is performed.
There are many models that use character-level subword information to form word representations (Ling et al., 2015;Cao and Rei, 2016;Kim et al., 2016;Wieting et al., 2016;Verwimp et al., 2017), as well as fastText (the model on which we base our work).Closely related are models that use morphological segmentation in learning word representations (Luong et al., 2013;Botha and Blunsom, 2014;Qiu et al., 2014;Mitchell and Steedman, 2015;Cotterell and Schütze, 2015;Bhatia et al., 2016).Our model also uses n-grams and morphological segmentation, but it performs explicit matrix factorization to learn subword and word representations, unlike these related models which mostly use neural networks.
Finally, Cotterell et al. (2016) and Vúlic et al. (2017) retrofit morphological information onto pre-trained models.These differ from our work in that we incorporate morphological information at training time, and that only Cotterell et al. (2016) is able to generate embeddings for OOV words.
where M is the word-context co-occurrence matrix constructed by sliding a window of fixed size centered over every target word w in the subsampled (Mikolov et al., 2013a) training corpus and incrementing cell M wc for every context word c appearing within this window (forming a (w, c) pair).LexVec adjusts the PPMI matrix using context distribution smoothing (Levy et al., 2014).
With the PPMI matrix calculated, the sliding window process is repeated and the following loss functions are minimized for every observed (w, c) pair and target word w: (2) where u w and v c are d-dimensional word and context vectors.The second loss function describes how, for each target word, k negative samples (Mikolov et al., 2013a) are drawn from the smoothed context unigram distribution.Given a set of subwords S w for a word w, we follow fastText and replace u w in eqs.( 2) and (3) by u w such that: such that a word is the sum of its word vector and its d-dimensional subword vectors q x .The number of possible subwords is very large so the function hash(s)3 hashes a subword to the interval [1, buckets].For OOV words, We compare two types of subwords: simple n-grams (like fastText) and unsupervised morphemes.For example, given the word "cat", we mark beginning and end with angled brackets and use all n-grams of length 3 to 6 as subwords, yielding S cat = { ca, at , cat}.Morfessor (Virpioja, 2013) is used to probabilistically segment words into morphemes.The Morfessor model is trained using raw text so it is entirely unsupervised.For the word "subsequent", we get S subsequent = { sub, sequent }.

Materials
Our experiments aim to measure if the incorporation of subword information into LexVec results in similar improvements as observed in moving from Skip-gram to fastText, and whether unsupervised morphemes offer any advantage over ngrams.For IV words, we perform intrinsic evaluation via word similarity and word analogy tasks, as well as downstream tasks.OOV word representation is tested through qualitative nearest-neighbor analysis.
All models are trained using a 2015 dump of Wikipedia, lowercased and using only alphanumeric characters.Vocabulary is limited to words that appear at least 100 times for a total of 303517 words.Morfessor is trained on this vocabulary list.
All five models are run for 5 iterations over the training corpus and generate 300 dimensional word representations.LV-N, LV-M, and FT use 2000000 buckets when hashing subwords.
For word similarity evaluations, we use the WordSim-353 Similarity (WS-Sim) and Relatedness (WS-Rel) (Finkelstein et al., 2001) and SimLex-999 (SimLex) (Hill et al., 2015) datasets, and the Rare Word (RW) (Luong et al., 2013) dataset to verify if subword information improves rare word representation.Relationships are measured using the Google semantic (GSem) and syntactic (GSyn) analogies (Mikolov et al., 2013a) and the Microsoft syntactic analogies (MSR) dataset (Mikolov et al., 2013b We also evaluate all five models on downstream tasks from the VecEval suite (Nayak et al., 2016) 5 , using only the tasks for which training and evaluation data is freely available: chunking, sentiment and question classification, and natural language identification (NLI).The default settings from the suite are used, but we run only the fixed settings, where the embeddings themselves are not tunable parameters of the models, forcing the system to use only the information already in the embeddings.

Results
Results for IV evaluation are shown in table 1, and for OOV in table 2.
Like in FT, the use of subword information in both LV-N and LV-M results in 1) better representation of rare words, as evidenced by the increase in RW correlation, and 2) significant improvement on the GSyn and MSR tasks, in evidence of sub-  words encoding information about a word's syntactic function (the suffix "ly", for example, suggests an adverb).
There seems to a trade-off between capturing semantics and syntax as in both LV-N and FT there is an accompanying decrease on the GSem tasks in exchange for gains on the GSyn and MSR tasks.Morphological segmentation in LV-M appears to favor syntax less strongly than do simple n-grams.
On the downstream tasks, we only observe statistically significant (p < .05under a random permutation test) improvement on the chunking task, and it is a very small gain.We attribute this to both regular and subword models having very similar quality on frequent IV word representation.Statistically, these are the words are that are most likely to appear in the downstream task instances, and so the superior representation of rare words has, due to their nature, little impact on overall accuracy.Because in all tasks OOV words are mapped to the " unk " token, the subword models are not being used to the fullest, and in future work we will investigate whether generating representations for all words improves task performance.
In OOV representation (table 2), LV-N and FT work almost identically, as is to be expected.Both find highly coherent neighbors for the words "hellooo", "marvelicious", and "rereread".Interestingly, the misspelling of "louisana" leads to coherent name-like neighbors, although none is the expected correct spelling "louisiana".All models stumble on the made-up prefix "tuz".A possible fix would be to down-weigh very rare subwords in the vector summation.LV-M is less robust than LV-N and FT on this task as it is highly sensitive to incorrect segmentation, exemplified in the "hellooo" example.
Finally, we see that nearest-neighbors are a mixture of similarly pre/suffixed words.If these pre/suffixes are semantic, the neighbors are semantically related, else if syntactic they have similar syntactic function.This suggests that it should be possible to get tunable representations which are more driven by semantics or syntax by a weighted summation of subword vectors, given we can identify whether a pre/suffix is semantic or syntactic in nature and weigh them accordingly.This might be possible without supervision using corpus statistics as syntactic subwords are likely to be more frequent, and so could be down-weighted for more semantic representations.This is something we will pursue in future work.

Conclusion and Future Work
In this paper, we incorporated subword information (simple n-grams and unsupervised morphemes) into the LexVec word embedding model and evaluated its impact on the resulting IV and OOV word vectors.Like fastText, subword LexVec learns better representations for rare words than its word-level counterpart.All models generated coherent representations for OOV words, with simple n-grams demonstrating more robustness than unsupervised morphemes.In future work, we will verify whether using OOV representations in downstream tasks improves performance.We will also explore the trade-off between semantics and syntax when subword information is used. ).

Table 2 :
We generate vectors for OOV using subword information and search for the nearest (cosine distance) words in the embedding space.The LV-M segmentation for each word is: { hell, o, o, o }, { marvel, i, cious }, { louis, ana }, { re, re, read }, { tu, z, read }.We omit the LV-N and FT n-grams as they are trivial and too numerous to list.