Making Sense of Word Embeddings

We present a simple yet effective approach for learning word sense embeddings. In contrast to existing techniques, which either directly learn sense representations from corpora or rely on sense inventories from lexical resources, our approach can induce a sense inventory from existing word embeddings via clustering of ego-networks of related words. An integrated WSD mechanism enables labeling of words in context with learned sense vectors, which gives rise to downstream applications. Experiments show that the performance of our method is comparable to state-of-the-art unsupervised WSD systems.


Introduction
Term representations in the form of dense vectors are useful for many natural language processing applications. First of all, they enable the computation of semantically related words. Besides, they can be used to represent other linguistic units, such as phrases and short texts, reducing the inherent sparsity of traditional vector-space representations (Salton et al., 1975).
One limitation of most word vector models, including sparse (Baroni and Lenci, 2010) and dense (Mikolov et al., 2013) representations, is that they conflate all senses of a word into a single vector. Several architectures for learning multiprototype embeddings were proposed that try to address this shortcoming (Huang et al., 2012;Tian et al., 2014;Neelakantan et al., 2014;Nieto Piña and Johansson, 2015;Bartunov et al., 2016). Li and Jurafsky (2015) provide indications that such sense vectors improve the performance of text pro-cessing applications, such as part-of-speech tagging and semantic relation identification.
The contribution of this paper is a novel method for learning word sense vectors. In contrast to previously proposed methods, our approach relies on existing single-prototype word embeddings, transforming them to sense vectors via ego-network clustering. An ego network consists of a single node (ego) together with the nodes they are connected to (alters) and all the edges among those alters. Our method is fitted with a word sense disambiguation (WSD) mechanism, and thus words in context can be mapped to these sense representations. An advantage of our method is that one can use existing word embeddings and/or existing word sense inventories to build sense embeddings. Experiments show that our approach performs comparably to state-of-the-art unsupervised WSD systems. dense vector spaces with neural networks. First, contexts are represented with word embeddings and clustered. Second, word occurrences are relabeled in the corpus according to the cluster they belong to. Finally, embeddings are re-trained on this sense-labeled terms. Tian et al. (2014) introduced a probabilistic extension of the Skip-gram model (Mikolov et al., 2013) that learns multiple sense-aware prototypes weighted by their prior probability. These models use parametric clustering algorithms that produce a fixed number of senses per word. Neelakantan et al. (2014) proposed a multisense extension of the Skip-gram model that was the first one to learn the number of senses by itself. During training, a new sense vector is allocated if the current context's similarity to existing senses is below some threshold. Li and Jurafsky (2015) use a similar idea by integrating the Chinese Restaurant Process into the Skip-gram model. All mentioned above sense embeddings were evaluated on the contextual word similarity task, each one improving upon previous models. Nieto and Johansson (2015) presented another multi-prototype modification of the Skip-gram model. Their approach outperforms that of Neelakantan et al. (2014), but requires as an input the number of senses for each word. Li and Jurafsky (2015) show that sense embeddings can significantly improve the performance of part-of-speech tagging, semantic relation identification and semantic relatedness tasks, but yield no improvement for named entity recognition and sentiment analysis. Bartunov et al. (2016) introduced AdaGram, a non-parametric method for learning sense embeddings based on a Bayesian extension of the Skipgram model. The granularity of learned sense embeddings is controlled by the parameter α. Comparisons of their approach to (Neelakantan et al., 2014) on three SemEval word sense induction and disambiguation datasets show the advantage of their method. For this reason, we use AdaGram as a representative of the state-of-the-art methods in our experiments.
Several approaches rely on a knowledge base (KB) to provide sense information. Bordes et al. (2011) propose a general method to represent entities of any KB as a dense vector. Such representation helps to integrate KBs into NLP systems. Another approach that uses sense inventories of knowledge bases was presented by Camacho- Collados et al. (2015). Rothe and Schütze (2015) combined word embeddings on the basis of Word-Net synsets to obtain sense embeddings. The approach is evaluated on lexical sample tasks by adding synset embeddings as features to an existing WSD system. They used a weighted pooling similar to the one we use, but their method is not able to find new senses in a corpus. Finally, Nieto Piña and Johansson (2016) used random walks on the Swedish Wordnet to generate training data for the Skip-gram model.

Word Sense Disambiguation (WSD)
Many different designs of WSD systems were proposed, see (Agirre and Edmonds, 2007;Navigli, 2009). Supervised approaches use an explicitly sense-labeled training corpus to construct a model, usually building one model per target word (Lee and Ng, 2002;Klein et al., 2002). These approaches demonstrate top performance in competitions, but require considerable amounts of senselabeled examples.
Knowledge-based approaches do not learn a model per target, but rather derive sense representation from information available in a lexical resource, such as WordNet. Examples of such system include (Lesk, 1986;Banerjee and Pedersen, 2002;Pedersen et al., 2005;Moro et al., 2014) Unsupervised WSD approaches rely neither on hand-annotated sense-labeled corpora, nor on Figure 2: Visualization of the ego-network of "table" with furniture and data sense clusters. Note that the target "table" is excluded from clustering. handcrafted lexical resources. Instead, they automatically induce a sense inventory from raw corpora. Such unsupervised sense induction methods fall into two categories: context clustering, such as (Pedersen and Bruce, 1997;Schütze, 1998;Reisinger and Mooney, 2010;Neelakantan et al., 2014;Bartunov et al., 2016) and word (egonetwork) clustering, such as (Lin, 1998;Pantel and Lin, 2002;Widdows and Dorow, 2002;Biemann, 2006;Hope and Keller, 2013). Unsupervised methods use disambiguation clues from the induced sense inventory for word disambiguation. Usually, the WSD procedure is determined by the design of sense inventory. It might be the highest overlap between the instance's context words and the words of the sense cluster, as in (Hope and Keller, 2013) or the smallest distance between context words and sense hubs in graph sense representation, as in (Véronis, 2004).

Learning Word Sense Embeddings
Our method consists of the four main stages depicted in Figure 1: (1) learning word embeddings; (2) building a graph of nearest neighbours based on vector similarities; (3) induction of word senses using ego-network clustering; and (4) aggregation of word vectors with respect to the induced senses.
Our method can use existing word embeddings, sense inventories and word similarity graphs. To demonstrate such use-cases and to study the performance of the method in different settings, as variants of the complete pipeline presented in Figure 1, we experiment with two additional setups. First, we use an alternative approach to compute the word similarity graph, which relies on dependency features and is expected to provide more accurate similarities (therefore, the stage (2) is changed). Second, we use a sense inventory constructed using crowdsourcing (thus, stages (2) and (3) are skipped). Below we describe each of the stages of our method in detail.

Learning Word Vectors
To learn word vectors, we use the word2vec toolkit (Mikolov et al., 2013), namely we train CBOW word embeddings with 100 or 300 dimensions, context window size of 3 and minimum word frequency of 5. We selected these parameters according to prior evaluations, e.g. (Baroni et al., 2014), and tested them on the development dataset (see Section 5.1). Initial experiments showed that this configuration is superior to others, e.g. the Skip-gram model, with respect to WSD performance.
For training, we modified the standard implementation of word2vec 1 so that it also saves context vectors needed for one of our WSD approaches. For experiments, we use two commonly used corpora for training distributional models: Wikipedia 2 and ukWaC (Ferraresi et al., 2008).

Calculating Word Similarity Graph
At this step, we build a graph of word similarities, such as (table, desk, 0.78). For each word we retrieve its 200 nearest neighbours. This number is motivated by prior studies (Biemann and Riedl, 2013;Panchenko, 2013): as observed, only few words have more strongly semantically related words. This graph is computed either based on word embeddings learned during the previous step or using semantic similarities provided by the JoBimText framework (Biemann and Riedl, 2013).

Similarities using word2vec (w2v).
In this case, nearest neighbours of a term are terms with the highest cosine similarity of their respective vectors. For scalability reasons, we perform similarity computations via block matrix multiplications, using blocks of 1000 vectors.
Similarities using JoBimText (JBT). In this unsupervised approach, every word is represented as a bag of sparse dependency-based features extracted using the Malt parser and collapsed using an approach similar to (Ruppert et al., 2015). Features are normalized using the LMI score (Church and Hanks, 1990) and further pruned down according to the recommended defaults: we keep 1000 features per word and 1000 words per feature. Similarity of two words is equal to the number of common features.
Multiple alternatives exist for computation of semantic relatedness (Zhang et al., 2013). JBT has two advantages in our case: (1) accurate estimation of word similarities based on dependency features; (2) efficient computation of nearest neighbours for all words in a corpus. Besides, we observed that nearest neighbours of word embeddings often tend to belong to the dominant sense, even if minor senses have significant support in the training corpus. We wanted to test if the same problem remains for a principally different method for similarity computation.
Algorithm 1: Word sense induction. input : T -word similarity graph, Nego-network size, n -ego-network connectivity, k -minimum cluster size output: for each term t ∈ T , a clustering S t of its N most similar terms foreach t ∈ T do V ← N most similar terms of t from T G ← graph with V as nodes and no edges E

Word Sense Induction
We induce a sense inventory using a method similarly to (Pantel and Lin, 2002) and (Biemann, 2006). A word sense is represented by a word cluster. For instance the cluster "chair, bed, bench, stool, sofa, desk, cabinet" can represent the sense "table (furniture)". To induce senses, first we construct an ego-network G of a word t and then perform graph clustering of this network. The iden-

Vector
Nearest Neighbours tified clusters are interpreted as senses (see Table 2). Words referring to the same sense tend to be tightly connected, while having fewer connections to words referring to different senses. The sense induction presented in Algorithm 1 processes one word t of the word similarity graph T per iteration. First, we retrieve nodes V of the ego-network G: these are the N most similar words of t according to T . The target word t itself is not part of the ego-network. Second, we connect the nodes in G to their n most similar words from T . Finally, the ego-network is clustered with the Chinese Whispers algorithm (Biemann, 2006). This method is parameter free, thus we make no assumptions about the number of word senses.
The sense induction algorithm has three metaparameters: the ego-network size (N ) of the target ego word t; the ego-network connectivity (n) is the maximum number of connections the neighbour v is allowed to have within the ego-network; the minimum size of the cluster k. The n parameter regulates the granularity of the inventory. In our experiments, we set the N to 200, n to 50, 100 or 200 and k to 5 or 15 to obtain different granulates, cf. (Biemann, 2010).
Each word in a sense cluster has a weight which is equal to the similarity score between this word and the ambiguous word t.

Pooling of Word Vectors
At this stage, we calculate sense embeddings for each sense in the induced inventory. We assume that a word sense is a composition of words that represent the sense. We define a sense vector as a function of word vectors representing cluster items. Let W be a set of all words in the train- and weighted average of word vectors: . Table 1 provides an example of weighted pooling results. While the original neighbours of the word "table" contain words related to both furniture and data, the neighbours of the sense vectors are either related to furniture or data, but not to both at the same time. Besides, each neighbour of a sense vector has a sense identifier as we calculate cosine between sense vectors, not word vectors.

Word Sense Disambiguation
This section describes how sense vectors are used to disambiguate a word in a context. Given a target word w and its context words C = {c 1 , . . . , c k }, we first map w to a set of its sense vectors according to the inventory: S = {s 1 , . . . , s n }. We use two strategies to choose a correct sense taking vectors for context words either from the matrix of context embeddings or from the matrix of word vectors. The first one is based on sense probability in given context: wherec c is the mean of context embeddings: k −1 k i=1 vec c (c i ) and functions vec c : W → R m map context words to context embeddings. Using the mean of context embeddings to calculate sense probability is natural with the CBOW because this model optimizes exactly the same mean to have high scalar product with word embeddings for words occurred in context and low scalar product for random words (Mikolov et al., 2013).
The second disambiguation strategy is based on similarity between sense and context: wherec w is the mean of word embeddings:c w = k −1 k i=1 vec w (c i ). The latter method uses only word vectors (vec w ) and require no context vectors (vec c ). This is practical, as the standard implementation of word2vec does not save context embeddings and thus most pre-computed models provide only word vectors.
To improve WSD performance we also apply context filtering. Typically, only several words in context are relevant for sense disambiguation, like "chairs" and "kitchen" are for "table" in "They bought a table and chairs for kitchen." For each word c j in context C = {c 1 , . . . , c k } we calculate a score that quantifies how well it discriminates the senses: where s i iterates over senses of the ambiguous word and f is one of our disambiguation strategies: either P (c j |s i ) or sim(s i , c j ). The p most discriminative context words are used for disambiguation.  Table 3: Influence of context filtering on disambiguation in terms of F-score. The models were trained on Wikipedia corpus; the w2v is based on weighted pooling and similarity-based disambiguation. All differences between filtered and unfiltered models are significant (p < 0.05).

Experiments
We evaluate our method on two complementary datasets: (1) a crowdsourced collection of senselabeled contexts; and (2) a commonly used Se-mEval dataset.

Evaluation on TWSI
The goal of this evaluation is to test different configurations of our approach on a large-scale dataset, i.e. it is used for development purposes.
Dataset. This test collection is based on a largescale crowdsourced resource by Biemann (2012) that comprises 1,012 frequent nouns with average polysemy of 2.26 senses per word. For these nouns the dataset provides 145,140 annotated sentences sampled from Wikipedia. Besides, it is accompanied by an explicit sense inventory, where each sense is represented with a list of words that can substitute target noun in a given sentence. The sense distribution across sentences in the dataset is skewed, resulting in 79% of contexts assigned to the most frequent senses. Therefore, in addition to the full TWSI dataset, we also use a balanced subset that has no bias towards the Most Frequent Sense (MFS). This dataset features 6,165 contexts with five contexts per sense excluding monosemous words.
Evaluation metrics. To compute WSD performance, we create an explicit mapping between the system-provided sense inventory and the TWSI senses: senses are represented as bag of words vectors, which are compared using cosine similarity. Every induced sense gets assigned to at most one TWSI sense. Once the mapping is completed, we can calculate precision and recall of sense prediction with respect to the original TWSI labeling.
Performance of a disambiguation model depends on quality of the sense mapping. These baselines facilitate interpretation of results: • Upper bound of the induced inventory selects the correct sense for the context, but only if the mapping exist for this sense.
• MFS of the TWSI inventory assigns the most frequent sense in the TWSI dataset.
• MFS of the induced inventory assigns the identifier of the largest sense cluster.
• Random sense baseline of the TWSI and induced sense inventories.
Discussion of results. Table 2 presents examples of the senses induced via clustering of nearest neighbours generated by word embeddings (w2v) and JBT as compared to the inventory produced via crowdsourcing (TWSI). The TWSI contains more senses (2.26 on average), while induced ones have less senses (1.56 and 1.64, respectively). The senses in the table are arranged in the way they are mapped to TWSI during evaluation. Table 3 illustrates how the granularity of the inventory influences WSD performance. The more granular the sense inventory, the better the match between the TWSI and the induced inventory can be established (mind that we map every induced sense to at most one TWSI sense). Therefore, the upper bound of WSD performance is maximal for the most fine-grained inventories.
However, the relation of actual WSD performance to granularity is inverse: the lower the number of senses, the higher the WSD performance (in the limit, we converge to the strong MFS baseline). We select a coarse-grained inventory for our further experiments (n=200, k = 15). Table 4 illustrates the fact that using context filtering positively impacts disambiguation performance, reaching optimal characteristics when two context words are used. Finally, Figure 3 presents results of our experiments on the full and sense-balanced TWSI datasets. First of all, our models significantly outperform random sense baseline of both TWSI and induced inventories. Secondly, we observe that pooling vectors using similarity scores as weights is better than unweighted pooling. Indeed, some clusters may contain irrelevant words and thus their contribution should be discounted. Third, we observe that using similarity-based disambiguation mechanism yields better results as  Table 4: Upper-bound and actual value of the WSD performance on the sense-balanced TWSI dataset, function of sense inventory used for unweighted pooling of word vectors. compared to the mechanism based on probabilities. Indeed, cosine similarity between embeddings proved to be useful for semantic relatedness, yielding stateof-the-art results (Baroni et al., 2014), while there is less evidence about successful use-cases of the CBOW as a language model. Fourth, we confirm our observation that filtering context words positively impacts WSD performance. Finally, we note that models based on JBTand w2v-induced sense inventories yield comparable results. However, the JBT inventory shows higher performance (0.410 vs 0.390) on the balanced TWSI, indicating the importance of a precise sense inventory. Finally, using the "gold" TWSI inventory significantly improves the performance on the balanced dataset outperforming models based on induced inventories.

Evaluation on SemEval-2013 Task 13
The goal of this evaluation is to compare the performance of our method to state-of-the-art unsupervised WSD systems.
Dataset. The SemEval-2013 task 13 "Word Sense Induction for Graded and Non-Graded Senses" (Jurgens and Klapaftis, 2013) provides 20 nouns, 20 verbs and 10 adjectives in WordNetsense-tagged contexts. It contains 20-100 contexts per word, and 4,664 contexts in total, which were drawn from the Open American National Corpus. Participants were asked to cluster these 4,664 instances into groups, with each group corresponding to a distinct word sense.
Evaluation metrics. Performance is measured with three measures that require a mapping of sense inventories (Jaccard Index, Tau and WNDCG) and two cluster comparison measures (Fuzzy NMI and Fuzzy B-Cubed).
Discussion of results. We compare our approach to SemEval participants and the AdaGram sense embeddings. The AI-KU system (Baskaya et al., 2013) directly clusters test contexts using the k-means algorithm based on lexical substitution features. The Unimelb system (Lau et al., 2013) uses a hierarchical topic model to induce  and disambiguate word senses. The UoS system (Hope and Keller, 2013) is most similar to our approach: to induce senses it builds an ego-network of a word using dependency relations, which is subsequently clustered using a simple graph clustering algorithm. The La Sapienza system (Agirre and Soroa, 2009), relies on WordNet to get word senses and perform disambiguation. Table 5 shows a comparative evaluation of our method on the SemEval dataset. Like above, dependency-based (JBT) word similarities yield slightly better results than word embedding similarity (w2v) for inventory induction. In addition to these two configurations, we also built a model based on the TWSI sense inventory (only for nouns as the TWSI contains nouns only). This model significantly outperforms both JBT-and w2v-based models, thus precise sense inventories greatly influence WSD performance.
As one may observe, performance of the best configurations of our method is comparable to the top-ranked SemEval participants, but is not systematically exceeding their results. AdaGram sometimes outperforms our method, sometimes it is on par, depending on the metric. We interpret these results as an indication of comparability of our method to state-of-the-art approaches.
Finally, note that none of the unsupervised WSD methods discussed in this paper, includ-ing the top-ranked SemEval submissions and Ada-Gram, were able to beat the most frequent sense baselines of the respective datasets (with the exception of the balanced version of TWSI). Similar results are observed for other unsupervised WSD methods (Nieto Piña and Johansson, 2016).

Conclusion
We presented a novel approach for learning of multi-prototype word embeddings. In contrast to existing approaches that learn sense embeddings directly from the corpus, our approach can operate on existing word embeddings. It can either induce or reuse a word sense inventory. Experiments on two datasets, including a SemEval challenge on word sense induction and disambiguation, show that our approach performs comparably to the state of the art.
An implementation of our method with several pre-trained models is available online. 3