Word and Document Embedding with vMF-Mixture Priors on Context Word Vectors

Word embedding models typically learn two types of vectors: target word vectors and context word vectors. These vectors are normally learned such that they are predictive of some word co-occurrence statistic, but they are otherwise unconstrained. However, the words from a given language can be organized in various natural groupings, such as syntactic word classes (e.g. nouns, adjectives, verbs) and semantic themes (e.g. sports, politics, sentiment). Our hypothesis in this paper is that embedding models can be improved by explicitly imposing a cluster structure on the set of context word vectors. To this end, our model relies on the assumption that context word vectors are drawn from a mixture of von Mises-Fisher (vMF) distributions, where the parameters of this mixture distribution are jointly optimized with the word vectors. We show that this results in word vectors which are qualitatively different from those obtained with existing word embedding models. We furthermore show that our embedding model can also be used to learn high-quality document representations.


Introduction
Word embedding models are aimed at learning vector representations of word meaning (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017).These representations are primarily learned from co-occurrence statistics, where two words are represented by similar vectors if they tend to occur in similar linguistic contexts.Most models, such as Skip-gram (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) learn two different vector representations w and w for each word w, which we will refer to as the target word vector and the context word vector respectively.Apart from the constraint that w i • wj should reflect how often words w i and w j co-occur, these vectors are typically unconstrained.
As was shown in (Mu et al., 2018), after performing a particular linear transformation, the angular distribution of the word vectors that are obtained by standard models is essentially uniform.This isotropy property is convenient for studying word embeddings from a theoretical point of view (Arora et al., 2016), but it sits at odds with fact that words can be organised in various natural groupings.For instance, we might perhaps expect that words from the same part-of-speech class should be clustered together in the word embedding.Similarly, we might expect that organising word vectors in clusters that represent semantic themes would also be beneficial.In fact, a number of approaches have already been proposed that use external knowledge for imposing such a cluster structure, capturing the intuition that words which belong to the same category should be represented by similar vectors (Xu et al., 2014;Guo et al., 2015;Hu et al., 2015;Li et al., 2016c) or be located in a low-dimensional subspace (Jameel and Schockaert, 2016).Such models tend to outperform standard word embedding models, but it is unclear whether this is only because they can take advantage of external knowledge, or whether imposing a cluster structure on the word vectors is itself also inherently useful.
In this paper, we propose a word embedding model which explicitly aims to learn context vectors that are organised in clusters.Note that unlike the aforementioned works, our method does not rely on any external knowledge.We simply impose the requirement that context word vectors should be clustered, without prescribing how these clusters should be defined.To this end, we extend the GloVe model by imposing a prior on the context word vectors.This prior takes the form of a mixture of von Mises-Fisher (vMF) distributions, which is a natural choice for modelling clusters in directional data (Banerjee et al., 2005).
We show that this results in word vectors that are qualitatively different from those obtained using existing models, significantly outperforming them in syntax-oriented evaluations.Moreover, we show that the same model can be used for learning document embeddings, simply by viewing the words that appear in a given document as context words.We show that the vMF distributions in that case correspond to semantically coherent topics, and that the resulting document vectors outperform those obtained with existing topic modelling strategies.

Related Work
A large number of works have proposed techniques for improving word embeddings based on external lexical knowledge.Many of these approaches are focused on external knowledge about word similarity (Yu and Dredze, 2014;Faruqui et al., 2015;Mrksic et al., 2016), although some approaches for incorporating categorical knowledge have been studied as well, as already mentioned in the introduction.What is different about our approach is that we do not rely on any external knowledge.We essentially impose the constraint that some category structure has to exist, without specifying what these categories look like.
The view that the words which occur in a given document collection have a natural cluster structure is central to topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its non-parametric counterpart called Hierarchical Dirichlet Processes (HDP) (Teh et al., 2005), which automatically discovers the number of latent topics based on the characteristics of the data.
In recent years, several approaches that combine the intuitions underlying topic models with word embeddings have been proposed.For example, in (Das et al., 2015) it was proposed to replace the usual representation of topics as multinomial distributions over words by Gaussian distributions over a pre-trained word embedding, while (Batmanghelich et al., 2016) and (Li et al., 2016b) used von Mises-Fisher distributions for this purpose.Note that documents are still modelled as multinomial distributions of topics in these models.In (He et al., 2017) the opposite approach is taken: documents and topics are represented as vectors, with the aim of modelling topic correlations in an efficient way, while each topic is represented as a multinomial distribution over words.In this paper, we take a different approach for learning document vectors, by not considering any documentspecific topic distribution.This allows us to represent document vectors and (context) word vectors in the same space and, as we will see, leads to improved empirical results.
Apart from using pre-trained word embeddings for improving topic representations, a number of approaches have also been proposed that use topic models for learning word vectors.For example, (Liu et al., 2015b) first uses the standard LDA model to learn a latent topic assignment for each word occurrence.These assignments are then used to learn vector representations of words and topics.Some extensions of this model have been proposed which jointly learn the topic-specific word vectors and the latent topic assignment (Li et al., 2016a;Shi et al., 2017).The main motivation for these works is to learn topic-specific word representations.They are thus similar in spirit to multiprototype word embeddings, which aim to learn sense-specific word vectors (Neelakantan et al., 2014).Our method is clearly different from these works, as our focus is on learning standard word vectors (as well as document vectors).
Regarding word embeddings more generally, the attention has recently shifted towards contextualized word embeddings based on neural language models (Peters et al., 2018).Such contextualized word embeddings serve a broadly similar purpose as the aforementioned topic-specific word vectors, but with far better empirical performance.Despite their recent popularity, however, it is worth emphasizing that state-of-the-art methods such as ELMO (Peters et al., 2018) rely on a concatenation of the output vectors of a neural language model with standard word vectors.For this reason, among others, the problem of learning standard word vectors remains an important research topic.

Model Description
The GloVe model (Pennington et al., 2014) learns for each word w a target word vector w and a context word vector w by minimizing the following objective: where x ij is the number of times w i and w j cooccur in the given corpus, b i and bj are bias terms and f (x ij ) is a weighting function aimed at reducing the impact of sparse co-occurrence counts.It is easy to see that this objective is equivalent to maximizing the following likelihood function where σ 2 > 0 can be chosen arbitrarily, N means the Normal distribution and Furthermore, D denotes the given corpus and Ω refers to the set of parameters learned by the word embedding model, i.e. the word vectors w i and wj and the bias terms.
The advantage of this probabilistic formulation is that it allows us to introduce priors on the parameters of the model.This strategy was recently used in the WeMAP model (Jameel et al., 2019) to replace the constant variance σ 2 by a variance σ 2 j that depends on the context word.In this paper, however, we will use priors on the parameters of the word embedding model itself.Specifically, we will impose a prior on the context word vectors w, i.e. we will maximize: Essentially, we want the prior P ( wi ) to model the assumption that context word vectors are clustered.To this end, we use a mixture of von-Mises Fisher distributions.To describe this distribution, we begin with a von Mises-Fisher (vMF) distribution (Mardia and Jupp, 2009;Hornik and Grün, 2014), which is a distribution over unit vectors in R d that depends on a parameter θ ∈ R d , where d will denote the dimensionality of the word vectors.The vMF density for x ∈ S d (with S d the d-dimensional unit hypersphere) is given by: where the denominator is given by which is commonly known as the confluent hypergeometric function.Note, however, that we will not need to evaluate this denominator, as it simply acts as a scaling factor.The normalized vector θ θ , for θ = 0, is the mean direction of the distribution, while θ is known as the concentration parameter.To estimate the parameter θ from a given set of samples, we can use maximum likelihood (Hornik and Grün, 2014).
A finite mixture of vMFs, which we denote as movMF, is a distribution on the unit hypersphere of the following form (x ∈ S d ): where K is the number of mixture components, ψ k ≥ 0 for each k, k ψ k = 1, and Θ = (θ 1 , ..., θ K ).The parameters of this movMF distribution can be computed using the Expectation-Maximization (EM) algorithm (Banerjee et al., 2005;Hornik and Grün, 2014).
Note that movMF is a distribution on unit vectors, whereas context word vectors should not be normalized.We therefore define the prior on context word vectors as follows: Furthermore, we use L2 regularization to constrain the norm w .We will refer to our model as CvMF.
In the experiments, following (Jameel et al., 2019), we will also consider a variant of our model in which we use a context-word specific variance σ 2 j .In that case, we maximize the following: where P (σ 2 j ) is modelled as an inverse-gamma distribution (NIG).Note that in this variant we do not use the weighting function f (x ij ), as this was found to be unnecessary when using a contextword specific variance σ 2 j in (Jameel et al., 2019).We will refer this variant as CvMF(NIG).
Document embedding.The model described above can also be used to learn document embeddings.To this end, the target word vectors are simply replaced by document vectors and the counts x ij then reflect how often word j occurs in document i. Below we will experimentally compare this strategy with existing methods for learning document representations, focusing especially on approaches that are inspired by probabilistic topic models.Indeed, we can intuitively think of the vMF mixture components in our model as representing topics.While there have already been topic models that use vMF distributions in this way (Batmanghelich et al., 2016;Li et al., 2016b), our approach is different because we do not consider a document-level topic distribution, and because we do not rely on pre-trained word embeddings.

Experiments
In this section we assess the potential of our model both for learning word embeddings (Section 4.1) and for learning document embeddings (Section 4.2).Our implementation along with trained vectors is available online1 .

Word Embedding Results
In this section, we describe the word embedding results, where we directly compare our model with the following baselines: GloVe (Pennington et al., 2014), Skipgram (Mikolov et al., 2013b) (denoted as SG), Continuous Bag of Words (Mikolov et al., 2013b) (denoted as CBOW), and the recently proposed WeMAP model (Jameel et al., 2019).We have used the Wikipedia dataset which was shared by Jameel et al. (2019), using the same vocabulary and preprocessing strategy.We report results for 300-dimensional word vectors and we use K = 3000 mixture components for our model.As evaluation tasks, we use standard word analogy and similarity benchmarks.
Analogy.Table 1 shows word analogy results for three datasets.First, we show results for the Google analogy dataset (Mikolov et al., 2013a) which is available from the GloVe project2 and covers a mix of semantic and syntactic relations.These results are shown separately in Table 1 as Gsem and Gsyn respectively.Second, we consider the Microsoft syntactic word analogy dataset3 , which only covers syntactic relations and is referred to as MSR.Finally, we show results for the BATS analogy dataset4 , which covers four categories of relations: inflectional morphology (IM), derivational morphology (DM), encyclopedic semantics (ES) and lexicographic semantics (LS).The results in Table 1 clearly show that our model behaves substantially differently from the baselines: for the syntactic/morphological relationships (Gsyn, MSR, IM, DM), our model outperforms the baselines in a very substantial way.On the other hand, for the remaining, semanticallyoriented categories, the performance is less strong, with particularly weak results for Gsem.For ES and IS, it needs to be emphasized that the results are weak for all models, which is partially due to a relatively high number of out-of-vocabulary words.In Figure 1 we show the impact of the number of mixture components K on the performance for Gsem and Gsyn (for the NIG variant).This shows that the under-performance on Gsem is not due to the choice of K.Among others, we can also see that a relatively high number of mixture components is needed to achieve the best results.Word similarity.The word similarity results are shown in Table 2, where we have considered the same datasets as Jameel et al. (2019) this dataset to 484 records.In most of these datasets, our model does not outperform the baselines, which is to be expected given the conclusion from the analogy task that our model seems specialized towards capturing morphological and syntactic features.Interestingly, however, in the RW and CA-660 datasets, which focus on rare words, our model performs clearly better than the baselines.Intuitively, we may indeed expect that the use of a prior on the context words acts as a form of smoothing, which can improve the representation of rare words.
Qualitative analysis.To better understand how our model differs from standard word embeddings, Table 3 shows the ten nearest neighbors (Al-Rfou et al., 2013) for a number of words according to our CvMF(NIG) model and according to the GloVe model.What can clearly be seen is that our model favors words that are of the same kind.For instance, the top 5 neighbours of fastest are all speed-related adjectives.As another example, the top 7 neighbors of red are colors.To further explore the impact of our model on rare words, Table 4 shows the nearest neighbors for some low-frequency terms.These examples clearly suggest that our model captures the meaning of these words in a better way than the GloVe model.For example, the top neighbors of casio are highly relevant terms such as notebook and compute, whereas the neighbors obtained with the GloVe model seem largely unrelated.For comparison, Table 5 shows the nearest neighbors of some high-frequency terms.In these case we can see that the GloVe model obtains the best results, as e.g.moreover is found as a neighbor of neural for our model, and indeed is found as a neighbor of clouds.This supports the results from the similarity benchmarks that our model performs better than standard methods at modelling rare words but worse at modelling frequent words.Finally, Table 6 shows the effect that our model can have on ambiguous words, where due to the use of the prior, a different dominant sense is found.

Document Embedding Results
To evaluate the document embeddings, we focus on two downstream applications: categorization and document retrieval.As an intrinsic evaluation, we also evaluate the semantic coherence of the topics identified by our model.Document Categorization.We have evaluated our document embeddings on four standard document classification benchmarks: 1) 20 Newsgroups (20NG)5 , 2) OHSUMED-23 (OHS)6 , 3) TechTC-300 (TechTC)7 , and 4) Reuters-21578 (Reu)8 .As baselines, we consider the following approaches: 1) TF-IDF weighted bag-ofwords representation, 2) LDA9 , 3) HDP10 , 4) the  (sHDP) 1314 , 7) GloVe15 (Pennington et al., 2014), 8) WeMAP (Jameel et al., 2019), 9) Skipgram (SG) and Continuous Bag-of-Words16 (Mikolov et al., 2013b) models.In the case of the word embedding models, we create document vectors in the same way as we do for our model, by simply replacing the role of target word vectors with document word vectors.In all the datasets, we removed punctuation and non-ASCII characters.We then segmented the sentences using Perl.In all models, parameters were tuned based on a development dataset.To this end, we randomly split our dataset into 60% training, 20% development and 20% testing.We report the results in terms of F1 score on the test set, using the Perf tool 17 .The trained document vectors were used as input to a linear SVM classifier whose trade-off parameter C was tuned from a pool of {10, 50, 100}, which is a common setting in document classification tasks.Note that our experimental setup is inherently different from those setups where a word embedding model is evaluated on the text classification task using deep neural networks, as our focus is on methods that learn document vectors in an unsupervised way.We have therefore adopted a setting where document vectors are used as the input to an SVM classifier.
In our model, we have set the number of word embeddings iterations to 50.The parameters of the vMF mixture model were re-computed after every 5 word embedding iterations.We tuned the dimensionality of the embedding from the pool {100, 150, 200} and the number of vMF mixture components from the pool {200, 500, 800}.
We used the default document topic priors and word topic priors in the LDA and the HDP topic models.For the LDA model, we tuned the number of topics from the pool {50, 80, 100} and the number of iterations of the sampler was set to 1000.We also verified in initial experiments that having a larger number of topics than 100 did not allow for better performance on the development data.The number of vMF mixtures of the comparative method, movMF, was tuned from the pool {200, 500, 800}.For GLDA, as in the original paper, we have used word vectors that were pre-trained using Skipgram on the English Wikipedia.We have tuned the word vectors size and number of topics from a pool of {100, 150, 200} and {50, 80, 100} respectively.The number of iterations of the sampler was again set to 1000.We have used same pre-trained word embeddings for sHDP, where again the number of dimensions was automatically tuned.ingly, this model also uses von Mishes-Fisher mixtures, but relies on a pre-trained word embedding.
Document Retrieval.Next we describe our document retrieval experiments.Specifically, we consider this problem as a learning-to-rank (LTR) task and use standard information retrieval (IR) tools to present our evaluation results.
We have adopted the same preprocessing strategy as for the categorization task, with the exception of OHSUMED, for which suitable LTR features are already given.For all other datasets we used the Terrier LTR framework23 to generate the six standard LTR document features as described in (Jameel et al., 2015).The document vectors were then concatenated with these six features24 .To perform the actual retrieval experiment, we used RankLib25 with a listwise RankNet (Burges et al., 2005) model26 .Our results are reported in terms of NDCG@10, which is a common evaluation metric for this setting.
Our training strategy is mostly the same as for the document categorization experiments, although for some parameters, such as the number of topics and vMF mixture components, we used larger values, which is a reflection of the fact that the collections used in this experiment are substantially larger and tend to be more diverse (Wei and Croft, 2006).In particular, the word vector lengths were chosen from a pool of {150, 200, 300} and the vMF mixtures from a pool of {300, 1000, 3000}.In the LDA model, we selected the number of topics from a pool of {100, 150, 200}.For GLDA we have used the same pool for the number of topics.All our results are reported for five-fold cross validation, where the parameters of the LTR model were automatically tuned, which is a common LTR experimental setting (Liu et al., 2015a).
The results are presented in Table 8, showing that our model is able to consistently outperform all methods.Among the baselines, our NIG variant achieves the best performance in this case, which is remarkable as this is also a word embedding model.
Word Coherence.In traditional topic models such as LDA, the topics are typically labelled by the k words that have the highest probability in the topic.These words tend to reflect semantically coherent themes, which is an important reason for the popularity of topic models.Accordingly, measuring the coherence of the top-k words that are identified by a given topic model, for each topic, is a common evaluation measure (Shi et al., 2017).Using the configurations that performed best on the tuning data in the document categorization task above, we used Gensim 27  and Sojka, 2010) to compute the coherence of the top-20 words using the c v metric (Röder et al., 2015).For our model, GDLA and sHDP, the mixture components that were learned were consided as topics for this experiment.For GloVe, WeMAP, SG, TF-IDF, and CBOW, we used the von Mises-Fisher (vMF) soft clustering model (Banerjee et al., 2005) to determine the cluster memberships of the context words.For the TF-IDF results, we instead used hard vMF clustering (Hornik and Grün, 2014), as the movMF results are based on TF-IDF features as well.We tuned the number of clusters using the tuning data.The top-20 words after applying the clustering model were then output based on the distance from the cluster centroid.
The results are shown in Table 9, showing that the word clusters defined by our mixture components are more semantically coherent than the topics obtained by the other methods.

Conclusions
In this paper, we analyzed the effect of adding a prior to the GloVe word embedding model, encoding the intuition that words can be organized in various natural groupings.Somewhat surprisingly, perhaps, this leads to a word embedding model which behaves substantially differently from existing methods.Most notably, our model substantially outperforms standard word embedding models in analogy tasks that focus on syntactic/morphological relations, although this comes at the cost of lower performance in semantically oriented tasks such as measuring word similarity.We also found that the model performs better than standard word embedding models when it comes to modelling rare words.
Word embedding models can also be used to learn document embeddings, by replacing word-word co-occurrences by document-word cooccurrences.This allowed us to compare our model with existing approaches that use von Mises-Fisher distributions for document modelling.In contrast to our method, these models are based on topic models (e.g. they typically model documents as a multinomial distribution over topics).Surprisingly, we found that the document representations learned by our model outperform these topic modelling-based approaches, even those that rely on pre-trained word embeddings and thus have an added advantage, considering that our model in this setting is only learned from the (often relatively small) given document collection.This finding puts into question the value of document-level topic distributions, which are used by many document embedding methods (being inspired by topic models such as LDA).

Figure 1 :
Figure 1: Accuracy vs number of vMF mixtures on the Google word analogy dataset for our model.

Table 1 :
(Pilehvar et al., 2018)r to EN-RW-Stanford as Stanf, EN-SIMLEX-999 as LEX, SimVerb3500 as Verb, EN-MTurk771 as Tr771, EN-MTurk287 as Tr287, EN-MENTR3K as TR3k, the RareWords dataset as RW, and the recently introduced Card-660 rare words dataset(Pilehvar et al., 2018)denoted as CA-660.Note that we have removed multi-word expressions from the RW-660 dataset and consider only unigrams, which reduces the size of Word analogy accuracy results on different datasets.

Table 3 :
Nearest neighbors for selected words.

Table 4 :
Nearest neighbors for low-frequency words.

Table 5 :
Nearest neighbors for high-frequency words.

Table 6 :
Nearest neighbors for ambiguous words.
Table 7 summarizes our document classification results.It can be seen that our model outperforms all baselines, except for the TechTC dataset, where the results are very close.Among the baselines, sHDP achieves the best performance.Interest-17 http://osmot.cs.cornell.edu/kddcup/software.html

Table 9 :
Word coherence results in c v computed using Gensim.