A Mixture Model for Learning Multi-Sense Word Embeddings

Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.


Introduction
Word embeddings have shown to be useful in various NLP tasks such as sentiment analysis, topic models, script learning, machine translation, sequence labeling and parsing Sutskever et al., 2014;Modi and Titov, 2014;Nguyen et al., 2015a,b;Modi, 2016;Ma and Hovy, 2016;Nguyen et al., 2017;Modi et al., 2017). A word embedding captures the syntactic and semantic properties of a word by representing the word in a form of a real-valued vector (Mikolov et al., 2013a,b;Pennington et al., 2014;Levy and Goldberg, 2014).
However, usually word embedding models do not take into account lexical ambiguity. For example, the word bank is usually represented by a single vector representation for all senses including sloping land and financial institution. Recently, approaches have been proposed to learn multi-sense word embeddings, where each sense of a word corresponds to a sense-specific embedding.
Reisinger and Mooney (2010), Huang et al. (2012) and Wu and Giles (2015) proposed methods to cluster the contexts of each word and then using cluster centroids as vector representations for word senses. Neelakantan et al. (2014), Tian et al. (2014), Li and Jurafsky (2015) and  extended Word2Vec models (Mikolov et al., 2013a,b) to learn a vector representation for each sense of a word. , Iacobacci et al. (2015) and Flekova and Gurevych (2016) performed word sense induction using external resources (e.g., WordNet, BabelNet) and then learned sense embeddings using the Word2Vec models. Rothe and Schütze (2015) and Pilehvar and Collier (2016) presented methods using pre-trained word embeddings to learn embeddings from WordNet synsets. , Liu et al. (2015b), Liu et al. (2015a) and Zhang and Zhong (2016) directly opt the Word2Vec Skip-gram model (Mikolov et al., 2013b) for learning the embeddings of words and topics on a topic-assigned corpus.
One issue in these previous works is that they assign the same weight to every sense of a word. The central assumption of our work is that each sense of a word given a context, should correspond to a mixture of weights reflecting different association degrees of the word with multiple senses in the context. The mixture weights will help to model word meaning better.
In this paper, we propose a new model for learning Multi-Sense Word Embeddings (MSWE). Our MSWE model learns vector representations of a word based on a mixture of its sense representations. The key difference between MSWE and other models is that we induce the weights of senses while jointly learning the word and sense embeddings. Specifically, we train a topic model (Blei et al., 2003) to obtain the topic-to-word and document-to-topic probability distributions which are then used to infer the weights of topics. We use these weights to define a compositional vec-tor representation for each target word to predict its context words. MSWE thus is different from the topic-based models Liu et al., 2015b,a;Zhang and Zhong, 2016), in which we do not use the topic assignments when jointly learning vector representations of words and topics. Here we not only learn vectors based on the most suitable topic of a word given its context, but we also take into consideration all possible meanings of the word.
The main contributions of our study are: (i) We introduce a mixture model for learning word and sense embeddings (MSWE) by inducing mixture weights of word senses. (ii) We show that MSWE performs better than the baseline Word2Vec Skipgram and other embedding models on the word analogy task (Mikolov et al., 2013a) and the word similarity task (Reisinger and Mooney, 2010).

The mixture model
In this section, we present the mixture model for learning multi-sense word embeddings. Here we treat topics as senses. The model learns a representation for each word using a mixture of its topical representations.
Given a number of topics and a corpus D of documents d = {w d,1 , w d,2 , ..., w d,M d }, we apply a topic model (Blei et al., 2003) to obtain the topic-to-word Pr(w|t) and document-to-topic Pr(t|d) probability distributions. We then infer a weight for the m th word w d,m with topic t in document d: We define two MSWE variants: MSWE-1 learns vectors for words based on the most suitable topic given document d while MSWE-2 marginalizes over all senses of a word to take into account all possible senses of the word: where s w d,m is the compositional vector representation of the m th word w d,m and the topics in document d; v w is the target vector representation of a word type w in vocabulary V ; v t is the vector representation of topic t; T is the number of topics; λ d,m,t is defined as in Equation 1, and in MSWE-1 we define t ′ = arg max t λ d,m,t .
We learn representations by minimizing the following negative log-likelihood function: where the m th word w d,m in document d is a target word while the (m + j) th word w d,m+j in document d is a context word of w d,m and k is the context size. In addition,ṽ w is the context vector representation of the word type w. The probability Pr(ṽ w d,m+j |s w d,m ) is defined using the softmax function as follows: with the following negative-sampling objective (Mikolov et al., 2013b): where each word c i is sampled from a noise distribution. 1 In fact, MSWE can be viewed as a generalization of the well-known Word2Vec Skip-gram model with negative sampling (Mikolov et al., 2013b) where all the mixture weights λ d,m,t are set to zero. The models are trained using Stochastic Gradient Descent (SGD).

Experiments
We evaluate MSWE on two different tasks: word similarity and word analogy. We also provide experimental results obtained by the baseline Word2Vec Skip-gram model and other previous works.
Note that not all previous results are mentioned in this paper for comparison because the training corpora used in most previous research work are much larger than ours Li and Jurafsky, 2015;Schwartz et al., 2015;Levy et al., 2015). Also there are differences in the pre-processing steps that could affect the results. We could also improve obtained results by using a larger training corpus, but this is not central point of our paper. The objective of our paper is that the embeddings of topic and word can be combined into a single mixture model, leading to good improvements as established empirically.

Experimental Setup
Following Huang et al. (2012) and Neelakantan et al. (2014), we use the Wesbury Lab Wikipedia corpus (Shaoul and Westbury, 2010) containing over 2M articles with about 990M words for training. In the preprocessing step, texts are lowercased and tokenized, numbers are mapped to 0, and punctuation marks are removed. We extract a vocabulary of 200,000 most frequent word tokens from the pre-processed corpus. Words not occurring in the vocabulary are mapped to a special token UNK, in which we use the embedding of UNK for unknown words in the benchmark datasets.
We firstly use a small subset extracted from the WS353 dataset (Finkelstein et al., 2002) to tune the hyper-parameters of the baseline Word2Vec Skip-gram model for the word similarity task (see Section 3.2 for the task definition). We then directly use the tuned hyper-parameters for our MSWE variants. Vector size is also a hyperparameter. While some approaches use a higher number of dimensions to obtain better results, we fix the vector size to be 300 as used by the baseline for a fair comparison. The vanilla Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003) is not scalable to a very large corpus, so we explore faster online topic models developed for large corpora. We train the online LDA topic model (Hoffman et al., 2010) on the training corpus, and use the output of this topic model to compute the mixture weights as in Equation 1. 2 We also use the same WS353 subset to tune the numbers of topics T ∈ {50, 100, 200, 300, 400}. We find that the most suitable numbers are T = 50 and T = 200 then used for all our experiments. Here we learn 300-dimensional embeddings with the fixed context size k = 5 (in Equation 2) and K = 10 (in Equation 3)  ing, we randomly initialize model parameters (i.e. word and topic embeddings) and then learn them by using SGD with the initial learning rate of 0.01.

Word Similarity
The word similarity task evaluates the quality of word embedding models (Reisinger and Mooney, 2010). For a given dataset of word pairs, the evaluation is done by calculating correlation between the similarity scores of corresponding word embedding pairs with the human judgment scores. Higher Spearman's rank correlation (ρ) reflects better word embedding model. We evaluate MSWE on standard datasets (as given in Table 1) for the word similarity evaluation task.
Following Reisinger and Mooney (2010), Huang et al. (2012), Neelakantan et al. (2014), we compute the similarity scores for a pair of words (w, w ′ ) with or without their respective contexts (c, c ′ ) as: where v w is the vector representation of the word w, v w,t is the multiple representation of the word w and the topic t, v c is the vector representation of the context c of the word w. And cos (v, v ′ ) is the cosine similarity between two vectors v and v ′ . For our experiments, we set v w,t = v w ⊕ (Pr(w|t) × v t ) and v c = 1 |c| w∈c v w ⊕  Table 2: Spearman's rank correlation (ρ × 100) for the word similarity task when using GlobalSim. Subscripts 50 and 200 denote the online LDA topic model trained with T = 50 and T = 200 topics, respectively. ⋆ denotes that our best score is significantly higher than the score of the baseline (with p < 0.05, online toolkit from http://www.philippsinger.info/?p=347). Scores in bold and underline are the best and second best scores.
( t Pr (t|c) × v t ), in which ⊕ is the concatenation operation and Pr (t|c) is inferred from the topic models by considering context c as a document. GlobalSim only regards word embeddings, while AvgSim considers multiple representations to capture different meanings (i.e. topics) and usages of a word. AvgSimC generalizes AvgSim by taking into account the likelihood δ (v w,t , v c ) that word w takes topic t given context c. δ (v, v ′ ) is the inverse of the cosine distance from v to v ′ (Huang et al., 2012;Neelakantan et al., 2014).  words taking into account different meanings.

Results for contextual word similarity
We evaluate our model MSWE by using AvgSim and AvgSimC on the benchmark SCWS dataset which considers effects of the contextual information on the word similarity task. As shown in Table 3, MSWE scores better than the closely related model proposed by  and generally obtains good results for this context sensitive dataset. Although we produce better scores than Neelakantan et al. (2014) and  when using GlobalSim, we are outperformed by them when using AvgSim and AvgSimC. Neelakantan et al. (2014) clustered the embeddings of the context words around each target word to predict its sense and  used pre-trained word embeddings to initialize vector representations of senses taken from Word-Net, while we use a fixed number of topics as senses for words in MSWE.

Word Analogy
We evaluate the embedding models on the word analogy task introduced by Mikolov et al. (2013a). The task aims to answer questions in the form of "a is to b as c is to ?", denoted as "a : b → c : ?" (e.g., "Hanoi : Vietnam → Bern : ?"). There are 8,869 semantic and 10,675 syntactic questions grouped into 14 categories. Each question is answered by finding the most suitable word closest to "v b − v a + v c " measured by the cosine similarity. The answer is correct only if the found closest word is exactly the same as the gold-standard (correct) one for the question.
We report accuracies in Table 4 and show that MSWE achieves better results in comparison with the baseline Word2Vec Skip-gram. In particular, MSWE reaches the accuracies of around 69.7% which is higher than the accuracy of 68.6% obtained by Word2Vec Skip-gram.

Conclusions
In this paper, we described a mixture model for learning multi-sense embeddings. Our model induces mixture weights to represent a word given context based on a mixture of its sense representations. The results show that our model scores better than Word2Vec, and produces highly competitive results on the standard evaluation tasks. In future work, we will explore better methods for taking into account the contextual information. We also plan to explore different approaches to compute the mixture weights in our model. For example, if there is a large sense-annotated corpus available for training, the mixture weights could be defined based on the frequency (sense-count) distributions, instead of using the probability distributions produced by a topic model. Furthermore, it is possible to consider the weights of senses as additional model parameters to be then learned during training.