Disambiguated skip-gram model

We present disambiguated skip-gram: a neural-probabilistic model for learning multi-sense distributed representations of words. Disambiguated skip-gram jointly estimates a skip-gram-like context word prediction model and a word sense disambiguation model. Unlike previous probabilistic models for learning multi-sense word embeddings, disambiguated skip-gram is end-to-end differentiable and can be interpreted as a simple feed-forward neural network. We also introduce an effective pruning strategy for the embeddings learned by disambiguated skip-gram. This allows us to control the granularity of representations learned by our model. In experimental evaluation disambiguated skip-gram improves state-of-the are results in several word sense induction benchmarks.


Introduction
Distributed representations of words find applications in a broad range of tasks, from natural language parsing (Socher et al., 2013) to image captioning (Karpathy and Fei-Fei, 2015). Their usefulness led to a renewed interest in word embedding algorithms. The most popular algorithms of this kind learn word vectors in an unsupervised manner, e.g., from word contexts (Mikolov et al., 2013a) or from statistics of word cooccurrence (Pennington et al., 2014). Unsupervised learning of word embeddings has a clear advantage: both general and domain-specific text corpora are available for a number of languages, which greatly reduces the cost of training. That said, unsupervised learning of word embeddings comes with its own challenges. One of the most important is word ambiguity: words in a natural language often have more than one meaning. The word mouse, for example, may mean a pointing device or an animal. Word embedding algorithms often do not recognize this language feature and estimate only one vector representation per word. This may lead to suboptimal word representations.
The main contribution of this work is disambiguated skip-gram: a neural-probabilistic model for learning distributed representations of words that capture word ambiguity. Disambiguated skipgram builds upon the skip-gram model introduced by Mikolov et al. (2013a,b). Skip-gram constructs word embeddings via an auxiliary prediction task: given a word in a sentence, skip-gram attempts to predict the surrounding words. To this end, skipgram defines a simple softmax model for the conditional probability of observing a context word c given the center word w: where D is the vocabulary. This log-bilinear model assigns two embedding vectors to every word w ∈ D: an input embedding vector v w and an output embedding vector u w . Skip-gram defines the training objective for a single example as the log-probability: log p (c | w). By maximizing this objective, skip-gram estimates input and output vectors that reflect semantic relations between words that occur in similar contexts. The input vectors are then used as word embeddings. The main idea behind disambiguated skip-gram is to jointly learn to disambiguate words and predict their contexts. We therefore extend skipgram with a parametric word sense disambiguation model. This allows us to discover word senses in an unsupervised manner, while preserving the simplicity of the skip-gram approach. In particular, unlike previous probabilistic models for multisense word embeddings, disambiguated skip-gram can be seen as a simple feed-forward neural network amendable to end-to-end training with back-propagation. Furthermore, disambiguated skipgram admits an effective pruning strategy for the learned word sense embeddings. In particular, we control the granularity of the learned representations by penalizing the entropy of the probability distributions learned by the disambiguation model. We then marginalize word sense probabilities over the training examples and prune embeddings with low marginal probability.
We have carried out an extensive experimental evaluation of disambiguated skip-gram. Our results demonstrate that the multi-sense word embeddings learned by disambiguated skip-gram improve state-of-the-art results in the word sense induction task.

Disambiguated skip-gram model
Let X = [w 1 , . . . , w n ], where each w i ∈ D, be a sequence of words from a vocabulary D. By C w i we denote the context of the word w i in X. For example, C w i can be a set of words that are no further than l positions from w i and are in the same sentence as w i . To simplify notation we will usually omit the sequence index i and write w ∈ X for an element of the input sequence X and C w for its context. We will also use notation y w for a vector y (from some set of vectors indexed by the vocabulary words) corresponding to the word w. Note that in this case we disregard the position of w in X and use just the word as the index. In particular, if some word w occurs multiple times in X, all occurrences share a single vector y w .
Similarly to skip-gram, the disambiguated skipgram model constructs word embeddings by learning to predict context words c ∈ C w given the center word w ∈ X. However, disambiguated skip-gram explicitly accounts for word ambiguity. To this end, we represent each word d ∈ D by a set of k sense embedding vectors v dz , indexed by z ∈ {1, . . . k}, and an output embedding vector u d . We then parametrize the conditional probability p (c | w, z = j) of observing a word c in the context of the word w in its j-th sense with a softmax model similar to the original skip-gram parametrization: Furthermore, in this work we assume that a sense of the word w ∈ X can be guessed from its con-text C w 1 , i.e. that: where z w is an index of the sense of the word w ∈ X. Given this assumption, we parametrize the probability distribution for z w using a softmax model similar to Eq. 2. That is, for each word d ∈ D we introduce k sense disambiguation vectors q ds , s ∈ {1, . . . , k}, and a context embedding vector r d . The conditional probability that the word w ∈ X occurs in its j-th sense is then modelled as: wherer w is a vector representation of C w . We represent C w by an average of context embedding vectors:r We can now define the training objective for a single word w ∈ X as the expected negative loglikelihood of observing the context C w under the distribution of senses of the center word w: with the parameters: φ = {v dz , u d , q dz , r d |d ∈ D, z = 1, . . . , k}.
The objective in Eq. 6 is inconvenient for gradient-based optimization, because the expectation is taken with respect to a probability distribution that is a function of model parameters. In principle, we can estimate the gradient of this objective with the score function estimator (Glynn, 1990;Williams, 1992). Unfortunately, the score function estimator suffers from a high variance, even when used with a control variate. One can also derive a low-variance unbiased gradient estimator for certain probability distributions, by expressing the samples as a differentiable function of model parameters and a random variable from some independent fixed distribution (Kingma and Welling, 2013). This approach is not directly applicable to our case, because categorical distribution do not admit reparametrization with a differentiable function. That said, a simple and effective biased gradient estimator for an expectation with respect to a categorical distribution was recently proposed by Jang et al. (2016) and Maddison et al. (2016). The basic idea is to reparametrize the samples from the categorical distribution with the Gumbel-Max trick (Gumbel, 1954) and then approximate the non-differentiable max operator with a softmax function with temperature hyperparameter. This can be seen as a reparametrization trick for a continuous relaxation to discrete samples from the categorical distribution.
In our case, the samples from p (z = j | w, C w ) take the form: where ξ s are i.i.d. samples from the standard Gumbel distribution f (0, 1). Note that the samples z w = [z w1 , . . . , z wk ] are now one-hot encoded. The continuous relaxation to z w is: where τ is the temperature hyper-parameter. When τ → 0, we recover the samples from p (z = j | w, C w ). However, for τ > 0 the samplesz w are no longer discrete. In this case we consider a relaxed sense embedding vector: (8) and model the conditional probability of observing a word c in the context of the word w as: Our training objective for a single word w ∈ X then takes the form: The relaxed objective in Eq. 10 is tractable and differentiable with respect to the parameters φ. When τ → 0, it becomes equivalent to the objective in Eq. 6. In practice, we approximate the expectation in Eq. 10 with a one-sample Monte Carlo estimator. In these settings disambiguated skip-gram can be seen as a simple feed-forward neural network pictured in Fig. 1. During training, the network jointly estimates a sense disambiguation model (Eq. 4) and a context word prediction model (Eq. 2), which we use to construct multisense word embeddings.

Pruning word senses
The disambiguated skip-gram model is parametric, i.e. it allocates a fixed number of sense embedding vectors to each word, even though different words have different number of discernible senses. That said, we can prune the sense embedding vectors by considering their probabilities according to the learned disambiguation model. In particular, after training we estimate the marginal probability: for each word d ∈ D and each sense index j ∈ {1, . . . , k}. We then prune sense embedding vectors with low marginal probability, e.g., p (d, j) < 0.05. The normalizing factor m d in Eq. 11 is the number of occurrences of the word d in X.
The above pruning technique can be extended to allow for an explicit control over the granularity of the learned sense representations. To this end, we use an entropy regularization term similar to the one studied by Pereyra et al. (2017) in classification networks. In disambiguated skip-gram the granularity of the learned representations is controlled by the disambiguation model (Eq. 4). Therefore, we extend the objective of our model (Eq. 10) by adding to it an entropy S of the probability distribution p (z = j | w, C w ): where: The hyper-parameter γ, which we further call entropy cost, controls the strength of the regularization and, in turn, the granularity of the learned sense representations. In particular, γ > 0 encourages the model to learn more coarse-grained sense representations, whereas γ < 0 increases the granularity of the learned senses.

Related work
Algorithms for learning distributed multi-sense representations of words have been a focus of several recent works. Initial approaches to this task relied on clustering word contexts. One of the first algorithms of this kind was proposed by Huang et al. (2012). They learn multi-sense word representations in three steps. First, they estimate vector representations of words using a feed-forward neural language model. Next, they calculate average word vector for each context in the training corpus, cluster these context representations and relabel each word in the corpus to a word sense represented by the nearest cluster. Finally, they train the language model on the relabelled corpus and obtain vector representations for word senses. Neelakantan et al. (2014) proposed the Multi-Sense Skip-gram (MSSG) model, that jointly learns context cluster prototypes and word sense embeddings. Their model extends skip-gram by maintaining context clusters for every word in the vocabulary. Given a training example with a center word w and its context representation c, they infer the word sense for w by a hard assignment of the context representation c to the cluster with the nearest centroid. Afterwards, they perform a skipgram-like update on the vector representation of the selected word sense and the output vectors of the context words. Neelakantan et al. also proposed a non-parametric version of MSSG (NP-MSSG), in which the number of clusters, and in turn the number of word senses, increases during training. They use a simple heuristic to determine the number of word senses: NP-MSSG allocates a new sense for the center word w when the similarity between the context representation c and the nearest cluster centroid falls below some predefined threshold. Neelakantan et al. demonstrated that MSSG and NP-MSSG outperform the Huang et al. algorithm on a contextual word similarity task.
A disadvantage of the Huang et al. and Neelakantan et al. algorithms is that they do not follow a principled statistical approach, but instead rely on hard clustering of context vectors. This has been addressed in more recent algorithms, which learn multi-sense word representations in a probabilistic framework. Concurrent to Neelakantan et al. work, Tian et al. (2014) proposed a probabilistic Multi-Prototype Skip-Gram (MPSG) model. MPSG extends the skip-gram model by adding to each position in the input text a latent variable that encodes the index of the word sense at that position. Furthermore, for each word in the vocabulary MPSG maintains a fixed number of sense embedding vectors and a single output vector. These parameters define a softmax model for the conditional probability of observing a context word given the center word and the latent sense index. Finally, MPSG models the conditional probability of observing a context word given the center word with a mixture model whose components correspond to the senses of the center word. Tian et al. derived an expectation maximization algorithm for estimation of softmax parameters and prior sense probabilities in their model. MPSG was evaluated in a contextual word similarity task, where its performance was similar to that of the Huang et al. algorithm. Bartunov et al. (2016) proposed the AdaGram model, which can be seen as a non-parametric version of MPSG. Similarly to MPSG, AdaGram introduces latent variables for word sense indexes in the input text. However, unlike MPSG, AdaGram does not assume a fixed number of word senses.
Instead, it defines the prior over word senses via a Dirichlet process. As a result, AdaGram automatically learns the number of senses for all words in the vocabulary. Unfortunately, defining the prior over word senses via a Dirichlet process gives an intractable model likelihood. Bartunov et al. therefore optimize variational lower bound of the AdaGram model likelihood using a stochastic variational inference algorithm. Bartunov et al. evaluated AdaGram performance on several word sense induction benchmarks. They demonstrated that AdaGram consistently outperforms MSSG, NP-MSSG and MPSG models in these benchmarks. AdaGram has been recently extended to handle parallel multilingual text corpora (Upadhyay et al., 2017).
For the prediction of context words (Eq. 2) disambiguated skip-gram adopts the softmax model used in MPSG. However, in contrast to the previous works, disambiguated skip-gram learns a parametric model for the conditional probability distribution over senses of the center word given the context words (Eq. 4). This allows us to define the training objective for disambiguated skip-gram as the expected negative log-likelihood of observing the context words under the distribution of senses of the center word. We use a biased low-variance gradient estimator for this objective, which enables stable end-to-end training with backpropagation.
The main goal of the AdaGram model is to automatically discover the number of word senses for the vocabulary words. This does not mean, however, that the number of senses learned by Ada-Gram is independent of model hyper-parameters. On the contrary, the number of senses learned by AdaGram is directly controlled by the hyper parameter α in the Dirichlet process used to define the prior over word meanings (Bartunov et al., 2016). Disambiguated skip-gram controls the number of learned senses by penalizing the entropy of the conditional probability distribution in the sense disambiguation model (Eq. 12). The entropy cost γ in this approach performs a function similar to the hyper-parameter α in AdaGram.
In addition to the works discussed above, word ambiguity was also modelled using topic models (Liu et al., 2015), large bi-directional language models (Peters et al., 2018) or subword information (Athiwaratkun et al., 2018). Also, Li and Jurafsky (2015) evaluated multi-sense embeddings in several downstream tasks. They found that multi-sense embeddings improve performance in tasks such as POS tagging or identification of semantic relations. They also identify downstream tasks which do not benefit from sense disambiguation. In sentiment analysis, for example, word sentiment usually does not depend on the inferred sense.

Experiments
We conducted a number of experiments to evaluate the quality of multi-sense word embeddings learned by disambiguated skip-gram. This section reports results from our evaluation. First, we report qualitative results from our model for several polysemous words. We then compare the performance of disambiguated skip-gram with several competing algorithms on a set of word sense induction tasks. Finally, we evaluate the effect of the entropy cost on the learned representations.
It is worth noting that the quality of multisense word embeddings was formerly assessed in contextual word similarity experiments. However, Bartunov et al. (2016) demonstrated that contextual word similarity experiments do not reflect the quality of multi-sense representations. In particular, the best performance in contextual word similarity task is often achieved by the baseline skip-gram model, which does not recognize word senses. This can be attributed to the fact that skip-gram objective directly optimizes similarity of vector representations of words that appear in similar contexts. Multi-sense models, on the other hand, solve a harder task: they disambiguate words in contexts and then model the similarities between the discovered senses. Bartunov et al. focus, therefore, on the performance of multi-sense embeddings in the word sense induction task. We adopt their evaluation methodology in this work.
We trained our disambiguated skip-gram models on the Westbury Lab Wikipedia corpus (Shaoul and Westbury, 2010). We optimized the models using mini-batch stochastic gradient descend with momentum.

Qualitative results
We begin our evaluation by presenting senses discovered by disambiguated skip-gram for several ambiguous words. For the demonstration we trained four 300-dimensional models with three sense embedding vectors allocated to each word  Table 1: Nearest neighbors and marginal probabilities p of word sense embedding vectors discovered by the disambiguated skip-gram model for several ambiguous words. Sense embedding vectors with a marginal probability p < 0.05 are pruned from the learned model. and the entropy cost γ ranging from 0.0 to 0.5. For each of the evaluated words sense embeddings we calculated the cosine similarity to the remaining words and selected 5 nearest neighbours. The results are reported in Tab. 1.
Disambiguated skip-gram discovered main meanings of our test words. For example, the meanings discovered for the word fox correspond to a broadcasting company, an animal and a fam-ily name. The meanings discovered for the word mouse correspond to a cartoon character, a computer mouse and an animal.
Results in Tab. 1 also demonstrate that disambiguated skip-gram will often expose an internal structure in a word meaning, if that meaning appears in different contexts. For example, disambiguated skip-gram learned two embeddings for the word plant corresponding to its factory mean-ing: one related to heavy industry and one related to a farm or a plant nursery. This is a consequence of the fact that disambiguated skip-gram discovers word senses using only the information about the contexts in which these words occur. In particular, it does not employ any supervision from an external knowledge base. Bartunov et al. (2016) refer to a related phenomenon in the AdaGram embeddings as the semantic resolution of the model.

Word-sense induction experiments
To compare disambiguated skip-gram with stateof-the-art competing algorithms we assessed its performance in a set of word sense induction tasks. In this evaluation we follow the experimental setup from (Bartunov et al., 2016), allowing for direct comparison with the results reported therein. In particular, we evaluated disambiguated skip-gram on the datasets from the SemEval-2007 Task 2 competition (SE-2007) (Jurgens and Klapaftis, 2013) and follow the preprocessing steps reported in (Bartunov et al., 2016).
We use a simple procedure for resolving word senses. That is, we average all sense embedding vectors of all context words and select the sense z w of the center word w whose embedding vector is most similar to the average vector: where:c The intuition behind this procedure is that we expect the average to preserve a shared component in the embedding vectors, namely embeddings for senses related to the sense of the center word, and cancel out embeddings of unrelated senses. In addition to averaging sense embedding vectors we also experimented with averaging output vectors of context words. However, this approach usually gave slightly worse results. Following Bartunov et al. (2016), we use adjusted rand index (Hubert and Arabie, 1985) to compare ground truth senses for a given word with the senses inferred from disambiguated skip-gram embeddings. The final performance on a benchmark task is the average of adjusted rand index values over all test words in the task.
For this evaluation we trained a 300dimensional disambiguated skip-gram model with 5 sense embedding vectors allocated to each word and no entropy cost (γ = 0.0). The comparison between our model and MSSG, NP-MSSG, MPSG and AdaGram is reported in Tab Table 2: Performance of multi-sense word embedding methods in word sense induction tasks. The reported performance metric is the adjusted rand index averaged over all test words in the benchmark task. Results for all models except the disambiguated skip-gram (Disamb. skip-gram) are taken from (Bartunov et al., 2016).

Effect of the entropy cost
Results reported in Tab. 1 demonstrate that the entropy cost γ (Eq. 12) indeed allows for pruning senses learned by disambiguated skip-gram. In particular, when the entropy cost increases, disambiguated skip-gram allocates more of the marginal probability mass (Eq. 11) to the frequent meanings of the modelled words and, in effect, learns coarser representations.
For a quantitative evaluation of the effect of entropy cost on the learned representations we trained 50-dimensional disambiguated skip-gram models with γ ranging from 0.0 to 1.0. All models allocate 5 sense embedding vectors to every word in the vocabulary. In Tab. 3 we report an average number of senses per word with marginal probability p ≥ 0.05, depending on the value of the entropy cost. In Fig. 2 we also report histograms of marginal probabilities for selected entropy cost values.   Histograms in Fig. 2 confirm our observation from the qualitative evaluation: when the entropy cost increases, disambiguated skip-gram learns more peaked distributions for the conditional sense probability p (z = j | w, C w ). This translates to coarser sense representations. In particular, the model with no entropy cost learned an average of 4.7 senses per word with marginal probability p ≥ 0.05 (Tab. 3). This number decreases with an increasing entropy cost, reaching an average of 2.5 senses per word for γ = 1.0.
We also evaluated 50and 300-dimensional models with different entropy costs in the word sense induction tasks. In each case we pruned senses with marginal probability p < 0.05. Re-  Table 4: Performance of disambiguated skip-gram models with different entropy costs in the word sense induction tasks. The reported performance metric is the adjusted rand index averaged over all test words in the benchmark task.
sults from this evaluation (Tab. 4) indicate that the desired granularity of the learned sense representations depends on the underlying task. In the WWSI benchmark the best performing models had no entropy cost, while in the SemEval tasks small entropy cost usually improved results. The results agree for both model dimensionalities.

Conclusions
In this work we developed disambiguated skipgram: a novel neural-probabilistic model for learning multi-sense distributed representations of words. Unlike previous probabilistic models for multi-sense word embeddings, disambiguated skip-gram is a simple feed-forward neural network and can be trained end-to-end with backpropagation. In experimental evaluation disambiguated skip-gram improved over the state-of-the-art results in three out of four benchmark datasets and ranked second the fourth. Disambiguated skip-gram optimizes expected log-likelihood of the context prediction model under the distribution of word senses parametrized by the disambiguation model. We choose to optimize this objective with a biased but low-variance gradient estimator. However, parallel to this work there has been a significant progress in gradientbased training of models with discrete latent variables. Specifically,  proposed an unbiased low-variance gradient estimator, called REBAR, that is applicable to models with categorical latent variables. REBAR may allow to efficiently optimize the original disambiguated skip-gram objective (Eq. 6), instead of the relaxed objective (Eq. 10). This may further improve the quality of embeddings learned with our approach.