Morphological Priors for Probabilistic Neural Word Embeddings

Word embeddings allow natural language processing systems to share statistical information across related words. These embeddings are typically based on distributional statistics, making it difficult for them to generalize to rare or unseen words. We propose to improve word embeddings by incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, we combine morphological and distributional information in a unified probabilistic framework, in which the word embedding is a latent variable. The morphological information provides a prior distribution on the latent word embeddings, which in turn condition a likelihood function over an observed corpus. This approach yields improvements on intrinsic word similarity evaluations, and also in the downstream task of part-of-speech tagging.


Introduction
Word embeddings have been shown to improve many natural language processing applications, from language models (Mikolov et al., 2010) to information extraction (Collobert and Weston, 2008), and from parsing (Chen and Manning, 2014) to machine translation (Cho et al., 2014). Word embeddings leverage a classical idea in natural language processing: use distributional statistics from large amounts of unlabeled data to learn representations that allow sharing * The first two authors contributed equally.
Code is available at https://github.com/rguthrie3/ MorphologicalPriorsForWordEmbeddings. across related words (Brown et al., 1992). While this approach is undeniably effective, the long-tail nature of linguistic data ensures that there will always be words that are not observed in even the largest corpus (Zipf, 1949). There will be many other words which are observed only a handful of times, making the distributional statistics too sparse to accurately estimate the 100-or 1000-dimensional dense vectors that are typically used for word embeddings. These problems are particularly acute in morphologically rich languages like German and Turkish, where each word may have dozens of possible inflections.
Recent work has proposed to address this issue by replacing word-level embeddings with embeddings based on subword units: morphemes (Luong et al., 2013;Botha and Blunsom, 2014) or individual characters (Santos and Zadrozny, 2014;Kim et al., 2016). Such models leverage the fact that word meaning is often compositional, arising from subword components. By learning representations of subword units, it is possible to generalize to rare and unseen words.
But while morphology and orthography are sometimes a signal of semantics, there are also many cases similar spellings do not imply similar meanings: better-batter, melon-felon, dessert-desert, etc. If each word's embedding is constrained to be a deterministic function of its characters, as in prior work, then it will be difficult to learn appropriately distinct embeddings for such pairs. Automated morphological analysis may be incorrect: for example, really may be segmented into re+ally, incorrectly suggesting a similarity to revise and review. Even correct morphological segmentation may be misleading. Consider 490 that incredible and inflammable share a prefix in-, which exerts the opposite effect in these two cases. 1 Overall, a word's observed internal structure gives evidence about its meaning, but it must be possible to override this evidence when the distributional facts point in another direction.
We formalize this idea using the machinery of probabilistic graphical models. We treat word embeddings as latent variables (Vilnis and McCallum, 2014), which are conditioned on a prior distribution that is based on word morphology. We then maximize a variational approximation to the expected likelihood of an observed corpus of text, fitting variational parameters over latent binary word embeddings. For common words, the expected word embeddings are largely determined by the expected corpus likelihood, and thus, by the distributional statistics. For rare words, the prior plays a larger role. Since the prior distribution is a function of the morphology, it is possible to impute embeddings for unseen words after training the model.
We model word embeddings as latent binary vectors. This choice is based on linguistic theories of lexical semantics and morphology. Morphemes are viewed as adding morphosyntactic features to words: for example, in English, un-adds a negation feature (unbelievable), -s adds a plural feature, and -ed adds a past tense feature (Halle and Marantz, 1993). Similarly, the lexicon is often viewed as organized in terms of features: for example, the word bachelor carries the features HUMAN, MALE, and UNMAR-RIED (Katz and Fodor, 1963). Each word's semantic role within a sentence can also be characterized in terms of binary features (Dowty, 1991;Reisinger et al., 2015). Our approach is more amenable to such theoretical models than traditional distributed word embeddings. However, we can also work with the expected word embeddings, which are vectors of probabilities, and can therefore be expected to hold the advantages of dense distributed representations (Bengio et al., 2013).

Model
The modeling framework is illustrated in Figure 1, focusing on the word sesquipedalianism. This word is rare, but its morphology indicates several of its properties: the -ism suffix suggests that the word is a noun, likely describing some abstract property; the sesqui-prefix refers to one and a half, and so on. If the word is unknown, we must lean heavily on these intuitions, but if the word is well attested then we can rely instead on its examples in use.
It is this reasoning that our modeling framework aims to formalize. We treat word embeddings as latent variables in a joint probabilistic model. The prior distribution over a word's embedding is conditioned on its morphological structure. The embedding itself then participates, as a latent variable, in a neural sequence model over a corpus, contributing to the overall corpus likelihood. If the word appears frequently, then the corpus likelihood dominates the prior -which is equivalent to relying on the word's distributional properties. If the word appears rarely, then the prior distribution steps in, and gives a best guess as to the word's meaning.
Before describing these component pieces in detail, we first introduce some notation. The representation of word w is a latent binary vector b w ∈ {0, 1} k , where k is the size of each word embedding. As noted in the introduction, this binary representation is motivated by feature-based theories of lexical semantics (Katz and Fodor, 1963). Each word w is constructed from a set of M w observed morphemes, M w = (m w,1 , m w,2 , . . . , m w,Mw ). Each morpheme is in turn drawn from a finite vocabulary of size v m , so that m w,i ∈ {1, 2, . . . , v m }. Morphemes are obtained from an unsupervised morphological segmenter, which is treated as a black box. Finally, we are given a corpus, which is a sequence of words, x = (x 1 , x 2 , . . . , x N ), where each word x t ∈ {1, 2, . . . , v w }, with v w equal to the size of the vocabulary, including the token UNK for unknown words.

Prior distribution
The key differentiating property of this model is that rather than estimating word embeddings directly, we treat them as a latent variable, with a prior distribution reflecting the word's morphological proper-plagued by sesquipedalianism . . . ties. To characterize this prior distribution, each morpheme m is associated with an embedding of its own, u m ∈ R k , where k is again the embedding size. Then for position i of the word embedding b w , we have the following prior, where σ(·) indicates the sigmoid function. The prior log-likelihood for a set of word embeddings is,

Expected likelihood
The corpus likelihood is computed via a recurrent neural network language model (Mikolov et al., 2010, RNNLM), which is a generative model of sequences of tokens. In the RNNLM, the probability of each word is conditioned on all preceding words through a recurrently updated state vector. This state vector in turn depends on the embeddings of the previous words, through the following update equations: The function f (·) is a recurrent update equation; in the RNN, it corresponds to σ(Θh t−1 + b xt ), where σ(·) is the elementwise sigmoid function. The matrix V ∈ R v×k contains the "output embeddings" of each word in the vocabulary. We can then define the conditional log-likelihood of a corpus x = (x 1 , x 2 , . . . x N ) as, Since h t−1 is deterministically computed from x 1:t−1 (conditioned on b), we can equivalently write the log-likelihood as, update equations are well known, we focus on the more concise RNN notation, but we employ LSTMs in all experiments due to their better ability to capture long-range dependencies.

Variational approximation
Inference on the marginal likelihood P (x 1:N ) = P (x 1:N , b)db is intractable. We address this issue by making a variational approximation, The variational distribution Q(b) is defined using a fully factorized mean field approximation, The variational distribution is a product of Bernoullis, with parameters γ w,j ∈ [0, 1]. In the evaluations that follow, we use the expected word embeddings q(b w ), which are dense vectors in [0, 1] k . We can then use Q(·) to place a variational lower bound on the expected conditional likelihood, Even with this variational approximation, the expected log-likelihood is still intractable to compute. In recurrent neural network language models, each word x t is conditioned on the entire prior history, x 1:t−1 -indeed, this is one of the key advantages over fixed-length n-gram models. However, this means that the individual expected log probabilities involve not just the word embedding of x t and its immediate predecessor, but rather, the embeddings of all words in the sequence x 1:t : We therefore make a further approximation by taking a local expectation over the recurrent state, This approximation means that we do not propagate uncertainty about h t through the recurrent update or through the likelihood function, but rather, we use local point estimates. Alternative methods such as variational autoencoders (Chung et al., 2015) or sequential Monte Carlo (de Freitas et al., 2000) might provide better and more principled approximations, but this direction is left for future work.

Variational bounds in the form of Equation 13
can generally be expressed as a difference between an expected log-likelihood term and a term for the Kullback-Leibler (KL) divergence between the prior distribution P (b) and the variational distribution Q(b) (Wainwright and Jordan, 2008). Incorporating the approximation in Equation 19, the resulting objective is, The KL divergence is equal to, Each term in the variational bound can be easily constructed in a computation graph, enabling automatic differentiation and the application of standard stochastic optimization techniques.

Implementation
The objective function is given by the variational lower bound in Equation 20, using the approximation to the conditional likelihood described in Equation 19. This function is optimized in terms of several parameters: • the morpheme embeddings, {u m } m∈1...vm ; • the variational parameters on the word embeddings, {γ} w∈1...vw ; • the output word embeddings V; • the parameter of the recurrence function, Θ.
Each of these parameters is updated via the RMSProp online learning algorithm (Tieleman and Hinton, 2012). The model and baseline (described below) are implemented in blocks (van Merriënboer et al., 2015). In the remainder of the paper, we refer to our model as VAREMBED.

Data and preprocessing
All embeddings are trained on 22 million tokens from the the North American News Text (NANT) corpus (Graff, 1995). We use an initial vocabulary of 50,000 words, with a special UNK token for words that are not among the 50,000 most common. We then perform downcasing and convert all numeric tokens to a special NUM token. After these steps, the vocabulary size decreases to 48,986. Note that the method can impute word embeddings for out-of-vocabulary words under the prior distribution P (b; M, u); however, it is still necessary to decide on a vocabulary size to determine the number of variational parameters γ and output embeddings to estimate. Unsupervised morphological segmentation is performed using Morfessor (Creutz and Lagus, 2002), with a maximum of sixteen morphemes per word. This results in a total of 14,000 morphemes, which includes stems for monomorphemic words. We do not rely on any labeled information about morphological structure, although the incorporation of gold morphological analysis is a promising topic for future work.

Learning details
The LSTM parameters are initialized uniformly in the range [−0.08, 0.08]. The word embeddings are initialized using pre-trained word2vec embeddings. We train the model for 15 epochs, with an initial learning rate of 0.01, a decay of 0.97 per epoch, and minibatches of size 25. We clip the norm of the gradients (normalized by minibatch size) at 1, using the default settings in the RMSprop implementation in blocks. These choices are motivated by prior work (Zaremba et al., 2014). After each iteration, we compute the objective function on the development set; when the objective does not improve beyond a small threshold, we halve the learning rate.
Training takes roughly one hour per iteration using an NVIDIA 670 GTX, which is a commodity graphics processing unit (GPU) for gaming. This is nearly identical to the training time required for our reimplementation of the algorithm of Botha and Blunsom (2014), described below.

Baseline
The most comparable approach is that of Botha and Blunsom (2014). In their work, embeddings are estimated for each morpheme, as well as for each invocabulary word. The final embedding for a word is then the sum of these embeddings, e.g., greenhouse = greenhouse + green + house, (24) where the italicized elements represent learned embeddings.
We build a baseline that is closely inspired by this approach, which we call SUMEMBED. The key difference is that while Botha and Blunsom (2014) build on the log-bilinear language model (Mnih and Hinton, 2007), we use the same LSTM-based architecture as in our own model implementation. This enables our evaluation to focus on the critical difference between the two approaches: the use of latent variables rather than summation to model the word embeddings. As with our method, we used pre-trained word2vec embeddings to initialize the model.

Number of parameters
The dominant terms in the overall number of parameters are the (expected) word embeddings themselves. The variational parameters of the input word embeddings, γ, are of size k × v w . The output word embeddings are of size #|h| × v w . The morpheme embeddings are of size k × v m , with v m v w . In our main experiments, we set v w = 48, 896 (see above), k = 128, and #|h| = 128. After including the character embeddings and the parameters of the recurrent models, the total number of parameters is roughly 12.8 million. This is identical to number of parameters in the SUMEMBED baseline.

Evaluation
Our evaluation compares the following embeddings: WORD2VEC We train the popular word2vec CBOW (continuous bag of words) model (Mikolov et al., 2013), using the gensim implementation.
SUMEMBED We compare against the baseline described in § 3.3, which can be viewed as a reimplementation of the compositional model of Botha and Blunsom (2014).
VAREMBED For our model, we take the expected embeddings E q [b], and then pass them through an inverse sigmoid function to obtain values over the entire real line.

Word similarity
Our first evaluation is based on two classical word similarity datasets: Wordsim353 (Finkelstein et al., 2001) and the Stanford "rare words" (rw) dataset (Luong et al., 2013). We report Spearmann's ρ, a measure of rank correlation, evaluating on both the entire vocabulary as well as the subset of in-vocabulary words.
As shown in Table 1, VAREMBED consistently outperforms SUMEMBED on both datasets. On the subset of in-vocabulary words, WORD2VEC gives slightly better results on the wordsim words that are in the NANT vocabulary, but is not applicable to the complete dataset. On the rare words dataset, WORD2VEC performs considerably worse than both morphology-based models, matching the findings of Luong et al. (2013) and Botha and Blunsom (2014) regarding the importance of morphology for doing well on this dataset.

Alignment with lexical semantic features
Recent work questions whether these word similarity metrics are predictive of performance on downstream tasks (Faruqui et al., 2016). The QVEC statistic is another intrinsic evaluation method, which has been shown to be better correlated with downstream tasks (Tsvetkov et al., 2015). This metric measures the alignment between word embeddings and a set of lexical semantic features. Specifically, we use the semcor noun verb supersenses oracle provided at the qvec github repository. 2 As shown in Table 2, VAREMBED outperforms SUMEMBED on the full lexicon, and gives similar performance to WORD2VEC on the set of invocabulary words. We also consider the morpheme embeddings alone. For SUMEMBED, this means that we construct the word embedding from the sum of the embeddings for its morphemes, without the additional embedding per word. For VAREMBED, we use the expected embedding under the prior distribution E[b | c]. The results for these representations are shown in the bottom half of Table 2, revealing that VAREMBED learns much more meaningful embeddings at the morpheme level, while much of the power of SUMEMBED seems to come from the word embeddings.

Part-of-speech tagging
Our final evaluation is on the downstream task of part-of-speech tagging, using the Penn Treebank. We build a simple classification-based tagger, using a feedforward neural network. (This is not intended as an alternative to state-of-the-art tagging algorithms, but as a comparison of the syntactic utility of the information encoded in the word embeddings.) The inputs to the network are the concatenated embeddings of the five word neighborhood (x t−2 , x t−1 , x t , x t+1 , x t+2 ); as in all evaluations, 128-dimensional embeddings are used, so the total size of the input is 640. This input is fed into a network with two hidden layers of size 625, and a softmax output layer over all tags. We train using RMSProp (Tieleman and Hinton, 2012). Results are shown in Table 3. Both morphologically-informed embeddings are significantly better to WORD2VEC (p < .01, two-tailed binomial test), but the difference between SUMEMBED and VAREMBED is not significant   at p < .05. Figure 2 breaks down the errors by word frequency. As shown in the figure, the tagger based on WORD2VEC performs poorly for rare words, which is expected because these embeddings are estimated from sparse distributional statistics. SUMEMBED is slightly better on the rarest words (the 0 − 100 group accounts for roughly 10% of all tokens). In this case, it appears that this simple additive model is better, since the distributional statistics are too sparse to offer much improvement. The probabilistic VAREMBED embeddings are best for all other frequency groups, showing that it effectively combines morphology and distributional statistics.

Related work
Adding side information to word embeddings An alternative approach to incorporating additional information into word embeddings is to constrain the embeddings of semantically-related words to be similar. Such work typically draws on existing lexical semantic resources such as WordNet. For example, Yu and Dredze (2014) define a joint training objective, in which the word embedding must predict not only neighboring word tokens in a corpus, but also related word types in a semantic resource; a similar approach is taken by Bian et al. (2014). Alternatively,  propose to "retrofit" pre-trained word embeddings over a semantic network. Both retrofitting and our own approach treat the true word embeddings as latent variables, from which the pretrained word embeddings are stochastically emitted. However, a key difference from our approach is that the underlying representation in these prior works is relational, and not generative. These methods can capture similarity between words in a relational lexicon such as WordNet, but they do not offer a generative account of how (approximate) meaning is constructed from orthography or morphology.

Word embeddings and morphology
The SUMEMBED baseline is based on the work of Botha and Blunsom (2014), in which words are segmented into morphemes using MORFESSOR (Creutz and Lagus, 2002), and then word representations are computed through addition of morpheme representations. A key modeling difference from this prior work is that rather than computing word embeddings directly and deterministically from subcomponent embeddings (morphemes or characters, as in Kim et al., 2016)), we use these subcomponents to define a prior distribution, which can be overridden by distributional statistics for common words. Other work exploits morphology by training word embeddings to optimize a joint objective over distributional statistics and rich, morphologically-augmented part of speech tags (Cotterell and Schütze, 2015). This can yield better word embeddings, but does not provide a way to compute embeddings for unseen words, as our approach does. Recent work by Cotterell et al. (2016) extends the idea of retrofitting, which was based on semantic similarity, to a morphological framework. In this model, embeddings are learned for morphemes as well as for words, and each word embedding is conditioned on the sum of the morpheme embeddings, using a multivariate Gaussian. The covariance of this Gaussian prior is set to the inverse of the number of examples in the training corpus, which has the effect of letting the morphology play a larger role for rare or unseen words. Like all retrofitting approaches, this method is applied in a pipeline fashion after training word embeddings on a large corpus; in contrast, our approach is a joint model over the morphology and corpus. Another practical difference is that Cotterell et al. (2016) use gold morphological features, while we use an automated morphological segmentation.
Latent word embeddings Word embeddings are typically treated as a parameter, and are optimized through point estimation (Bengio et al., 2003;Collobert and Weston, 2008;Mikolov et al., 2010). Current models use word embeddings with hundreds or even thousands of parameters per word, yet many words are observed only a handful of times. It is therefore natural to consider whether it might be beneficial to model uncertainty over word embeddings. Vilnis and McCallum (2014) propose to model Gaussian densities over dense vector word embeddings. They estimate the parameters of the Gaussian directly, and, unlike our work, do not consider using orthographic information as a prior distribution. This is easy to do in the latent binary framework proposed here, which is also a better fit for some theoretical models of lexical semantics (Katz and Fodor, 1963;Reisinger et al., 2015). This view is shared by Kruszewski et al. (2015), who induce binary word representations using labeled data of lexical semantic entailment relations, and by Henderson and Popa (2016), who take a mean field approximation over binary representations of lexical semantic features to induce hyponymy relations.
More broadly, our work is inspired by recent efforts to combine directed graphical models with discriminatively trained "deep learning" architectures. The variational autoencoder (Kingma and Welling, 2014), neural variational inference (Mnih and Gregor, 2014;Miao et al., 2016), and black box variational inference (Ranganath et al., 2014) all propose to use a neural network to compute the variational approximation. These ideas are employed by Chung et al. (2015) in the variational recurrent neural network, which places a latent continuous variable at each time step. In contrast, we have a dictionary of latent vari-ables -the word embeddings -which introduce uncertainty over the hidden state h t in a standard recurrent neural network or LSTM. We train this model by employing a mean field approximation, but these more recent techniques for neural variational inference may also be applicable. We plan to explore this possibility in future work.

Conclusion and future work
We present a model that unifies compositional and distributional perspectives on lexical semantics, through the machinery of Bayesian latent variable models. In this framework, our prior expectations of word meaning are based on internal structure, but these expectations can be overridden by distributional statistics. The model is based on the very successful long-short term memory (LSTM) for sequence modeling, and while it employs a Bayesian justification, its inference and estimation are little more complicated than a standard LSTM. This demonstrates the advantages of reasoning about uncertainty even when working in a "neural" paradigm.
This work represents a first step, and we see many possibilities for improving performance by extending it. Clearly we would expect this model to be more effective in languages with richer morphological structure than English, and we plan to explore this possibility in future work. From a modeling perspective, our prior distribution merely sums the morpheme embeddings, but a more accurate model might account for sequential or combinatorial structure, through a recurrent , recursive (Luong et al., 2013), or convolutional architecture (Kim et al., 2016). There appears to be no technical obstacle to imposing such structure in the prior distribution. Furthermore, while we build the prior distribution from morphemes, it is natural to ask whether characters might be a better underlying representation: character-based models may generalize well to nonword tokens such as names and abbreviations, they do not require morphological segmentation, and they require a much smaller number of underlying embeddings. On the other hand, morphemes encode rich regularities across words, which may make a morphologically-informed prior easier to learn than a prior which works directly at the character level. It is possible that this tradeoff could be transcended by combining characters and morphemes in a single model.
Another advantage of latent variable models is that they admit partial supervision. If we follow Tsvetkov et al. (2015) in the argument that word embeddings should correspond to lexical semantic features, then an inventory of such features could be used as a source of partial supervision, thus locking dimensions of the word embeddings to specific semantic properties. This would complement the graph-based "retrofitting" supervision proposed by , by instead placing supervision at the level of individual words.