Word2Sense: Sparse Interpretable Word Embeddings

We present an unsupervised method to generate Word2Sense word embeddings that are interpretable — each dimension of the embedding space corresponds to a fine-grained sense, and the non-negative value of the embedding along the j-th dimension represents the relevance of the j-th sense to the word. The underlying LDA-based generative model can be extended to refine the representation of a polysemous word in a short context, allowing us to use the embedings in contextual tasks. On computational NLP tasks, Word2Sense embeddings compare well with other word embeddings generated by unsupervised methods. Across tasks such as word similarity, entailment, sense induction, and contextual interpretation, Word2Sense is competitive with the state-of-the-art method for that task. Word2Sense embeddings are at least as sparse and fast to compute as prior art.


Introduction
Several unsupervised methods such as SkipGram (Mikolov et al., 2013) and Glove (Pennington et al., 2014) have demonstrated that co-occurrence data from large corpora can be used to compute low-dimensional representations of words (a.k.a. embeddings) that are useful in computational NLP tasks. While not as accurate as semi-supervised methods such as BERT (Devlin et al., 2018) and ELM O (Peters et al., 2018) that are trained on various downstream tasks, they do not require massive amounts of compute unaccessible to all but few.
Nearly all such methods produce dense representations for words whose coordinates in themselves have no meaningful interpretation. The numerical values of a word's embedding are meaningful only in relation to representations of other words. A unitary rotation can be applied to many of these embeddings retaining their utility for computational tasks, and yet completely changing the values of individual coordinates. Can we design an interpretable embedding whose coordinates have a clear meaning to humans?
Ideally such an embedding would capture the multiple senses of a word, while being effective at computational tasks that use inter-word spacing of embeddings. Loosely, a sense is a set of semantically similar words that collectively evoke a bigger picture than individual words in the reader's mind. In this work, we mathematically define a sense to be a probability distribution over the vocabulary, just as topics in topic models. A human can relate to a sense through the words with maximum probability in the sense's probability distribution. Table 1 presents the top 10 words for a few senses.
We describe precisely such an embedding of words in a space where each dimension corresponds to a sense. Words are represented as probability distributions over senses so that the magnitude of each coordinate represents the relative importance of the corresponding sense to the word. Such embeddings would naturally capture the polysemous nature of words. For instance, the embedding for a word such as cell with many senses -e.g. "biological entity", "mobile phones", "excel sheet", "blocks", "prison" and "battery" (see Table 1) -will have support over all such senses.
To recover senses from a corpus and to represent word embeddings as (sparse) probability distributions over senses, we propose a generative model (Figure 1) for the co-occurrence matrix: (1) associate with each word w a sense distribution θ w with Dirichlet prior; (2) form a context around a target word w by sampling senses z according to θ w , and sample words from the distribution of sense z. This allows us to use fast inference tools such as WarpLDA (Chen et al., 2016) to recover few thousand fine-grained senses from large cor-  pora and construct the embeddings.
W ord2Sense embeddings are extremely sparse despite residing in a higher dimensional space (few thousand), and the number of nonzeros in the embeddings is no more than 100. In comparison, W ord2vec performs best on most tasks when computed in 500 dimensions.
These sparse single prototype embeddings effectively capture the senses a word can take in the corpus, and can outperform probabilistic embeddings (Athiwaratkun and Wilson, 2017) at tasks such as word entailment, and compete with W ord2vec embeddings and multi-prototype embeddings (Neelakantan et al., 2015) in similarity and relatedness tasks.
Unlike prior work such as W ord2vec and GloVe, our generative model has a natural extension for disambiguating the senses of a polysemous word in a short context. This allows the refinement of the embedding of a polysemous word to a W ordCtx2Sense embedding that better reflects the senses of the word relevant in the context. This is useful for tasks such as Stanford contextual word similarity (Huang et al., 2012) and word sense induction (Manandhar et al., 2010).
Our methodology does not suffer from computational constraints unlike W ord2GM (Athiwaratkun and Wilson, 2017) and M SSG (Neelakantan et al., 2015) which are constrained to learning 2-3 senses for a word. The key idea that gives us this advantage is that rather than constructing a per-word representation of senses, we construct a global pool of senses from which the senses a word takes in the corpus are inferred. Our methodology takes just 5 hours on one multicore processor to recover senses and embeddings from a concatenation of UKWAC (2.5B tokens) and Wackypedia (1B tokens) co-occurrence matrices (Baroni et al., 2009) with a vocabulary of 255434 words that occur at least 100 times.
Our major contributions include: • A single prototype word embedding that encodes information about the senses a word takes in the training corpus in a human interpretable way. This embedding outperforms W ord2vec in rare word similarity task and word relatedness task and is within 2% in other similarity and relatedness tasks; and outperforms W ord2GM on the entailment task of ( Baroni et al., 2012). • A generative model that allows for disambiguating the sense of a polysemous word in a short context that outperforms the state-of-the-art unsupervised methods on Word Sense Induction for Semeval-2010(Manandhar et al., 2010 and MakeSense-2016(Mu et al., 2017 datasets and is within 1% of the best models for the contextual word similarity task of (Huang et al., 2012).

Related Work
Several unsupervised methods generate dense single prototype word embeddings. These include W ord2vec (Mikolov et al., 2013), which learns embeddings that maximize the cosine similarity of embeddings of co-occurring words, and Glove (Pennington et al., 2014) and Swivel (Shazeer et al., 2016) that learn embeddings by factorizing the word co-occurrence matrix. (Dhillon et al., 2015;Stratos et al., 2015) use canonical correlation analysis (CCA) to learn word embeddings that maximize correlation with context. (Levy and Goldberg, 2014;Levy et al., 2015) showed that SVD based methods can compete with neural embeddings. (Lebret and Collobert, 2013) use Hellinger PCA, and claim that Hellinger distance is a better metric than Euclidean distance in discrete probability space.
Multiple works have considered converting the existing embeddings to interpretable ones. Murphy et al. (2012) use non-negative matrix factorization of the word-word co-occurrence matrix to derive interpretable word embeddings. (Sun et al., 2016;Han et al., 2012) change the loss function in Glove to incorporate sparsity and non negativity respectively to capture interpretability. (Faruqui et al., 2015) propose Sparse Overcomplete Word Vectors (SP OW V ), by solving an optimization problem in dictionary learning setting to produce sparse non-negative high dimensional projection of word embeddings. (Subramanian et al., 2018) use a k-sparse denoising autoencoder to produce sparse non-negative high dimensional projection of word embeddings, which they called SParse Interpretable Neural Embeddings (SP IN E). However, all these methods lack a natural extension for disambiguating the sense of a word in a context.
In a different line of work, Vilnis and McCallum (2015) proposed representing words as Gaussian distributions to embed uncertainty in dimensions of the embedding to better capture concepts like entailment. However, Athiwaratkun and Wilson (2017) argued that such a single prototype model can't capture multiple distinct meanings and proposed W ord2GM to learn multiple Gaussian embeddings per word. The prototypes were generalized to ellipical distributions in (Muzellec and Cuturi, 2018). A major limitation with such an approach is the restriction on the number of prototypes per word that can be learned, which is limited to 2 or 3 due to computational constraints.
Many words such as 'Cell' can have more than 5 senses. Another open issue is that of disambiguating senses of a polysemous word in a contextthere is no obvious way to embed phrases and sentences with such embeddings.
Multiple works have proposed multi-prototype embeddings to capture the senses of a polysemous word. For example, Neelakantan et al. (2015) extends the skipgram model to learn multiple embeddings of a word, where the number of senses of a word is either fixed or is learned through a non-parametric approach. Huang et al. (2012) learns multi-prototype embeddings by clustering the context window features of a word. However, these methods can't capture concepts like entailment. Tian et al. (2014) learns a probabilistic version of skipgram for learning multi-sense embeddings and hence, can capture entailment. However, all these models suffer from computational constraints and either restrict the number of prototypes learned for each word to 2-3 or restrict the words for which multiple prototypes are learned to the top k frequent words in the vocabulary.
Prior attempts at representing polysemy include (Pantel and Lin, 2002), who generate global senses by figuring out the best representative words for each sense from co-occurrence graph, and (Reisinger and Mooney, 2010), who generate senses for each word by clustering the context vectors of the occurrences of the word. Further attempts include Arora et al. (2018), who express single prototype dense embeddings, such as W ord2vec and Glove, as linear combinations of sense vectors. However, their underlying linearity assumption breaks down in real data, as shown by Mu et al. (2017). Further, the linear coefficients can be negative and have values far greater than 1 in magnitude, making them difficult to interpret. Neelakantan et al. (2015) and Huang et al. (2012) represent a context by the average of the embeddings of the words to disambiguate the sense of a target word present in the context. On the other hand, Mu et al. (2017) suggest representing sentences as a hyperspace, rather than a single vector, and represent words by the intersection of the hyperspaces representing the sentences it occurs in.
A number of works use naïve Bayesian method (Charniak et al., 2013) and topic models (Brody and Lapata, 2009;Yao and Van Durme, 2011;Pedersen, 2000;Lau et al., 2012Lau et al., , 2013Lau et al., , 2014 to learn senses from local contexts, treating each in- stance of a word within a context as a pseudodocument, and achieve state of the art results in WSI task (Manandhar et al., 2010). Since this approach requires training a single topic model per target word, it does not scale to all the words in the vocabulary.
In a different line of work, (Tang et al., 2014;Guo and Diab, 2011;Wang et al., 2015;Tang et al., 2015;Xun et al., 2017) transform topic models to learn local context level information through sense latent variable, in addition to the document level information through topic latent variable, for producing more fine grained topics from the corpus.

Notation
Let V = {w 1 , w 2 , ..w |V | } denote the set of unique tokens in corpus (vocabulary). Let C denote the word-word co-occurrence matrix constructed from the corpus, i.e., C ij is the number of times w j has occurred in the context of w i . We define a context around a token w as the set of n words to the left and n words to the right of w. We denote the size of context window by n. Typically n = 5.
Our algorithm uses LDA to infer a sense model β -essentially a set of k probability distributions over V -from the corpus. It then uses the sense model to encode a word w as a k -dimensional µ-sparse vector θ w . Here, we use α and γ, respectively, to denote the Dirichlet priors of θ w , the sense distribution of a word w, and β z , context word distribution in a sense z. JS is a k × k matrix that measures the similarity between senses. We denote the z th row of a matrix M by M z .

Recovering senses
To recover senses, we suppose the following generative model for generating words in a context of size n (see Figure 1).
1. For each word w ∈ V , generate a distribution over senses θ w from the Dirichlet distribution with prior α.
2. For each context c w around target word w, and for each of the 2n tokens ∈ c w , do Such a generative model will generate a cooccurrence matrix C that can also be generated by another model. C is a matrix whose columns C w are interpreted as a document formed from the count of all the tokens that have occurred in a context centered at w. Given a Dirichlet prior of parameter α on sense distribution of C w and β, the distribution over context words for each sense, document C w (and thus the co-occurrence matrix C) is generated as follows: Based on this generative model, given the cooccurrence matrix C, we infer the matrix β and the maximum aposteriori estimate θ w for each word using a fast variational inference tool such as WarpLDA (Chen et al., 2016). 5 W ord2Sense embeddings W ord2Sense embeddings are probability distributions over senses. We discuss how to use the senses recovered by inference on the generative model in section 4 to construct word embeddings. We demonstrate that the embeddings so computed are competitive with various multi-modal embeddings in semantic similarity and entailment tasks.

Computing W ord2Sense embeddings
Denote the probability of occurrence of a word in the corpus by p(w). We approximate the probability of the word p(w) by its empirical estimate C w 1 / w ∈V C w 1 . We define the global probability p Z (z) of a sense z as the probability that a randomly picked token in the corpus has that sense in it's context window. We approximate the global distribution of generated senses using the following formulation.
Then, for each word w ∈ V , we compute p c (w), its sense distribution (when acting as a context word) as follows: Eliminating redundant senses. LDA returns a number of topics that are very similar to each other. Examples of such topics are given in Table 11 in appendix. These topics need to be merged, since inferring two similar words against such senses can cause them to be (predominantly) assigned to two different topic ids, causing them to look more dissimilar than they actually are. In order to eliminate redundant senses, we use the similarity of topics according to the Jensen Shannon (JS) divergence. We construct the topic similarity matrix JS ∈ R k×k , whose [i, j]−th entry JS [i, j] is the JS divergence between senses β i and β j . Recall that JS divergence JSdiv(p, q) between two multinomial distributions p, q ∈ R k is given by We run agglomerative clustering on the JS matrix to merge similar topics. We use the following distance metric to merge two clusters D i and D j : .k denote the final set of k clusters obtained after clustering. We approximate the occurrence probability of the merged cluster of senses Table 11 in appendix shows some clusters formed after clustering. Using the merged senses, we compute the embedding v w of word w -a distribution over senses indexed by z ∈ {1..k} -as follows: (2) P roject is the function that maps v ∈ R k to v ∈ R k by merging the coordinates corresponding to the merged senses: . T runcate µ sparsifies the input by truncating it to the µ highest non-zeros in the vector.

Evaluation
We compare W ord2Sense embeddings with the state-of-the-art on word similarity and entailment tasks as well as on benchmark downstream tasks.

Hyperparameters
We train W ord2vec Skip-Gram embeddings with 10 passes over the data, using separate embeddings for the input and output contexts, 5 negative samples per positive example, window size n = 2 and the same sub-sampling and dynamic window procedure as in (Mikolov et al., 2013). For W ord2GM , we make 5 passes over the data (due to very long training time of the published code 1 ), using 2 modes per word, 1 negative sample per positive example, spherical covariance model, window size n = 10 and the same sub-sampling and dynamic window procedure as in (Athiwaratkun and Wilson, 2017). Since there is no recommended dimension in these papers, we report the numbers for the best performing embedding size. We report the performance of W ord2vec and W ord2GM at dimension 500 and 400 respectively 2 . We report the performance of SP OW V and SP IN E in benchmark downstream tasks, that use W ord2vec as base embeddings, using the recommended settings as given in (Faruqui et al., 2015) and (Subramanian et al., 2018)  We found k = 3000, α = 0.1 and γ = 0.001 to be good hyperparamters for WarpLDA to recover fine-grained senses from the corpus. A choice of k ≈ 3 4 k that merges k/4 senses improved results. We use a context window size n = 5 and truncation parameter µ = 75. We think µ = 75 works best because we found the average sparsity of p c (.|w) to be around 100. Since we decrease the number of senses by 1/4 th after post-processing, the average sparsity reduces to close to 75. If a word is not present in the vocabulary, we take an embedding on the unit simplex, that contains equal values in all the dimensions.

Word Similarity
We evaluate our embeddings at scoring the similarity or relatedness of pairs of words on several  Table 2: Comparison of word embeddings on word similarity evaluation datasets. For MSSG learned for top 30K and 6k words, we report the similarity of the global vectors of word, which we find to be better than comparing all the local vectors of words. For W ord2GM , we report numbers from our tuning as well as from the paper (in paranthesis). Note that we report higher numbers in all cases, except on WS353-S and WS353-R datasets. We attribute this to fewer passes over the data and possibly different pre-processing. a 0.353 with a different metric.   We predict similarity/relatedness score of a pair of words {w 1 , w 2 } by computing the JS divergence (see Equation 1) between the embeddings {v w 1 , v w 2 } as computed in Equation 2. For other embeddings, we use cosine similarity metric to measure similarity between embeddings. The final prediction effectiveness of an embedding is given by computing Spearman correlation between the predicted scores and the human annotated scores. Table 2 compares our embeddings to multimodal Gaussian mixture (Word2GM) model (Athiwaratkun and Wilson, 2017) and Word2vec (Mikolov et al., 2013). We extensively tune hyperparameters of prior work, often achieving better results than previously reported. We concluded from this exercise that SkipGram (W ord2vec) is the best among all the unsupervised embeddings at similarity and relatedness tasks. We see that while being interpretable and sparser than the 500-dimensional W ord2vec, W ord2Sense embeddings is competitive with W ord2vec on all the datasets.

Word entailment
Given two words w 1 and w 2 , w 2 entails w 1 (denoted by w 1 |= w 2 ) if all instances of w 1 are w 2 . We compare W ord2Sense embeddings with W ord2GM on the entailment dataset provided by (Baroni et al., 2012). We use KL divergence to generate entailment scores between words w 1 and w 2 . For W ord2GM , we use both cosine similarity and KL divergence, as used in the original paper. We report the F1 scores and Average Precision(AP) scores for reporting the quality of prediction. Table 3 compares the performance of our embedding with W ord2GM . We notice that W ord2Sense embeddings with µ = k (denoted W ord2Sense -full in the table), i.e., with no truncation, yields the best results. We do not compare with hyperbolic embeddings (Tifrea et al., 2019;Dhingra et al., 2018) because these embeddings are designed mainly to perform well on entailment tasks, but are far off from the performance of Euclidean embeddings on similarity tasks.

Downstream tasks
We compare the performance of W ord2Sense with W ord2vec, SP IN E and SP OW V embeddings on the following downstream classification tasks: sentiment analysis (Socher et al., 2013), news classification 5 , noun phrase chunking (Lazaridou et al., 2013) and question classification (Li and Roth, 2006). We do not compare with W ord2GM and MSSG as there is no obvious way to compute sentence embeddings from multi-modal word embeddings. The sentence embedding needed for text classification is the average of the embeddings of words in the sentence,  as in (Subramanian et al., 2018). We pick the best among SVMs, logistic regression and random forest classifier to classify the sentence embeddings based on accuracy on the development set. Table 4 reports the accuracies on the test set. More details of the tasks are provided in Appendix E.

Interpretability
We evaluate the interpretability of the W ord2Sense embeddings against W ord2vec, SP IN E and SP OW V models using the word intrusion test following the procedure in (Subramanian et al., 2018). We select the 15k most frequent words in the intersection of our vocabulary and the Leipzig corpus (Goldhahn et al., 2012). We select a set H of 300 random dimensions or senses from 2250 senses. For each dimension h ∈ H, we sort the words in the 15k vocabulary based on their weight in dimension h. We pick the top 4 words in the dimension and add to this set a random intruder word that lies in the bottom half of the dimension h and in the top 10 percentile of some other dimension h ∈ H \ {h} (Fyshe et al., 2014;Faruqui et al., 2015). For the dimension h to be claimed interpretable, independent judges must be able to easily separate the intruder word from the top 4 words. We split the 300 senses into ten sets of 30 senses, and assigned 3 judges to annotate the intruder in each of the 30 senses in a set (we used a total of 30 judges). For each question, we take the majority voted word as the predicted intruder. If a question has 3 different annotations, we count that dimension as non interpretable 6 . Since, we followed the procedure as in (Subramanian et al., 2018), we compare our performance with the results reported in their paper. Table 5 shows that W ord2Sense is competitive with the best interpretable embeddings.  Table 5: Comparison of embeddings on for Word Intrusion tasks. The second column indicates the inter annotator agreement -the first number is the fraction of questions for which at least 2 annotators agreed and the second indicates the fraction on which all three agreed. The last column is the precision of the majority vote.

Qualitative evaluation
We show the effectiveness of our embeddings at capturing multiple senses of a polysemous word in Table 1. For e.g. "tie" can be used as a verb to mean tying a rope, or drawing a match, or as a noun to mean clothing material. These three senses are captured in the top 3 dimensions of W ord2Sense embedding for "tie". Similarly, the embedding for "cell" captures the 5 senses discussed in section 1 within the top 15 dimensions of the embedding. The remaining top senses capture fine grained senses such as different kinds of biological cells -e.g. bone marrow cell, liver cell, neuron -that a subject expert might relate to.

W ordCtx2Sense embeddings
A word with several senses in the training corpus, when used in a context, would have a narrower set of senses. It is therefore important to be able to refine the representation of a word according to its usage in a context. Note that W ord2vec and W ord2GM models do not have such a mechanism. Here, we present an algorithm that generates an embedding for a target wordŵ in a short context T = {w 1 , .., w N } that reflects the sense in which the target word was used in the context. For this, we suppose that the senses of the wordŵ in context T are an intersection of the senses ofŵ and T . We therefore infer the sense distribution of T by restricting the support of the distribution to those sensesŵ can take.

Methodology
We suppose that the words in the context T were picked from a mixture of a small number of senses. Let S k = {ψ = (ψ 1 , ψ 2 , ..., ψ k ) : ψ z ≥ 0; z ψ z = 1} be the unit positive simplex. The generative model is as follows. Pick a ψ ∈ S k , and let P = βψ, where β is the collection of sense probability distributions recovered by LDA from the corpus. Pick N words from P independently.
where A is a vocabulary-sized vector containing the count of each word, normalized to sum 1. We do not use the Dirichlet prior over sense distribution as in the generative model in section 4, as we found its omission to be better at inferring the sense distribution of contexts. Given A and β, we want to infer the sense distribution ψ ∈ S k that minimizes the log perplexity f (ψ; A, β) = − |V | i A i log(βψ) i according to the generative model in Equation 3. The MWU -multiplicative weight update -algorithm (See Appendix A for details) is a natural choice to find such a distribution ψ, and has an added advantage. The MWU algorithm's estimate of a variable ψ w.r.t. a function f after t iterations (denoted ψ (t) ) satisfies .k} and ∀t ≥ 0.
Therefore, to limit the set of possible senses in the inference of ψ to the µ senses thatŵ can take, we initialize ψ (0) to the embedding vŵ. We used the embedding obtained in Equation 2 without the P roject operator that adds probabilities of similar senses, to correspond with the use of the original matrix β for MWU.
Further, to keep iterates close to the initial ψ (0) , we add a regularizer to log perplexity. This is necessary to bias the final inference towards the senses that the target word has higher weights on. Thus the loss function on which we run MWU with starting point where the second term is the KL divergence between two distributions scaled by a hyperparameter λ.
Recall that KL(p, q) = − k i=1 p i log(p i /q i ) for two distributions p, q ∈ R k . We use the final estimate ψ (t) as the Word-Ctx2Sense distribution of a word in the context.

Evaluation
We demonstrate that the above construction of a word's representation disambiguated in a context is useful by comparing with state-of-the-art unsupervised methods for polysemy disambiguation on two tasks: Word Sense Induction and contextual similarity. Specifically, we compare with MSSG, the K-Grassmeans model of (Mu et al., 2017), and the sparse coding method of (Arora et al., 2018). 7

Hyperparameters
We use the same hyperparameter values for α, β, k and n as in section 5.2.1. We use µ = 100 since we do not merge senses in this construction. We tune the hyperparameter λ to the task at hand.

Word Sense Induction
The WSI task requires clustering a collection of (say 40) short texts, all of which share a common polysemous word, in such a way that each cluster uses the common word in the same sense. Two datasets for this task are Semeval-2010(Manandhar et al., 2010 and MakeSense-2016(Mu et al., 2017. The evaluation criteria are F-score (Artiles et al., 2009) andV-Measure (Rosenberg andHirschberg, 2007). V-measure measures the quality of a cluster as the harmonic mean of homogeneity and coverage, where homogeneity checks if all the data-points that belong to a cluster belong to the same class and coverage checks if all the data-points of the same class belong to a single cluster. F-score is the harmonic mean of precision and recall on the task of classifying whether the instances in a pair belong to the same cluster or not. F-score tends to be higher with a smaller number of clusters and the V-Measure tends to be higher with a larger number of clusters, and it is important to show performance in both metrics.
For each text corresponding to a polysemous word, we learn a sense distribution ψ using the steps in section 7.1. We tuned the parameter λ and found the best performance at λ = 10 −2 . We use hard decoding to assign a cluster label to each text, i.e., we assign a label k = argmax k ψ k to a text with inferred sense vector ψ k .
Suppose that this yieldsk distinct clusters for the instances corresponding to a polysemous word. We cluster them using agglomerative clustering into a final set of K clusters. The distance metric used to group two clusters D i and D j is Note that we report baseline numbers from the original papers. These papers have trained their models on newer versions of Wikipedia dump that contain more than 3 billion tokens (MSSG uses a 1 billion token corpus). However, our model has been trained on a combined dataset of wiki-2009 dump and ukWaC, which contains around 3B tokens. Hence, there might be minor differences in comparing our model to the baseline models.   where JS is the similarity matrix defined in section 5.
Results Table 6 shows the results of clustering on WSI SemEval-2010 dataset. Word-Ctx2Sense outperforms (Arora et al., 2018) and(Mu et al., 2017) on both F-score and V-measure scores by a considerable margin. We observe similar improvements on the MakeSense-2016 dataset.

Word Similarity in Context
The Stanford Contextual Word Similarity task (Huang et al., 2012) consists of 2000 pairs of words, along with the contexts the words occur in. Ten human raters were asked to rate the similarity of each pair words according to their use in the corresponding contexts, and their average score (on a 1 to 10 scale) is provided as the ground-truth similarity score. The goal of a contextual embedding would be to score these examples to maximize the correlation with this ground-truth.
We compute the W ordCtx2Sense of each word in its respective context as in section 7.1. For comparing the meaning of two words in context, we use the JS divergence between their W ordCtx2Sense embeddings. We report the coefficient between the ground-truth and W ordCtx2Sense according to two different settings of λ. (a) λ = 0.1, and b) λ = 10 −3 for inferring the contextual embedding of a word in those pairs that contain same target words, and λ = 0.1 for all other pairs. The main idea is to reduce unnecessary bias for comparing sense of a polysemous word in two different contexts. Table 7 shows that sense embeddings using context information perform better than all the existing models, except MSSG models (Neelakantan et al., 2015). Also, computing the embeddings of a word using the contextual information improves results by aprox. 0.025, compared to the case when words embeddings are used directly.

Conclusion and future work
We motivated an efficient unsupervised method to embed words, in and out of context, in a way that captures their multiple senses in a corpus in an interpretable manner. We demonstrated that such interpretable embeddings can be competitive with dense embeddings like W ord2vec on similarity tasks and can capture entailment effectively. Further, the construction provides a natural mechanism to refine the representation of a word in a short context by disambiguating its senses. We have demonstrated the effectiveness of such contextual representations.
A natural extension to this work would be to capture the sense distribution of sentences using the same framework. This will make our model more comprehensive by enabling the embedding of words and short texts in the same space.  We use the default hyperparameters for training W ord2vec, as given in Mikolov et al. (2013). We tuned the embedding size, to see if the performance improves with increasing number of dimensions. Table 8 shows that there is minor improvement in performance in different similarity and relatedness tasks as the embedding size is increased from 100 to 500.
C Hyper-parameter tuning for W ord2GM We use the default hyperparameters for training W ord2GM , as given in Athiwaratkun and Wilson (2017). We tuned the embedding size, to see if the performance improves with increasing number of dimensions. Table 9 shows that there is minor improvement in performance of W ord2GM , when the embedding size is increased from 100 to 400.

D Hyper-parameter tuning for W ord2Sense
For generating senses, we use WarpLDA that has 3 different hyperparameters, a) Number of topics k b) α, the dirichlet prior of sense distribution of each word and c) γ, the dirichlet prior of word distribution of each sense. We keep k fixed at 3000 and vary α and β. We show a small subset of the hyperparameter space searched for α and β. We report the performance of word embeddings computed by Equation 3, without the P roject step, in different similarity tasks. Table 10 shows that the performance slowly decreases as we increase β and somewhat stays constant with α. Hence, we choose α = 0.1 amd γ = 0.001 for carrying out our experiments.

E Benchmark downstream tasks
In this section, we discuss about the different downstream tasks considered. We follow the same procedure as (Faruqui et al., 2015) and (Subramanian et al., 2018) 8 .
• Sentiment analysis This is a binary classification task on Sentiment Treebank dataset (Socher et al., 2013). The task is to give a sentence a positive or a negative sentiment label. We used the provided train, dev. and test splits of sizes 6920, 872 and 1821 sentences respectively.
• Noun phrase bracketing NP bracketing task (Lazaridou et al., 2013) involves classifying a noun phrase of 3 words as left bracketed or right bracketed. The dataset contains 2,227 noun phrases split into 10 folds. We append the word vectors of three words to get feature representation (Faruqui et al., 2015). We report 10-fold cross validation accuracy.
• Question classification Question classification task (Li and Roth, 2006) involves classifying a question into six different types, e.g., whether the question is about a location, about a person or about some numeric information. The training dataset consists of 5452 labeled questions, and the test dataset consists of 500 questions.
• News classification We consider three binary categorization tasks from the 20 Newsgroups dataset. Each task involves categorizing a document according to two related categories ( Table 11: Examples of clusters formed after agglomerative clustering. Each group of rows shows a randomly picked cluster, it's size and top 10 words of 3 randomly picked senses from the cluster. The clusters represent U.S. states, generic words, video games, and soccer respectively.