Context-Dependent Sense Embedding

,


Introduction
Distributed representation of words (aka word embedding) aims to learn continuous-valued vectors to * The second author was supported by the National Natural Science Foundation of China (61503248).
represent words based on their context in a large corpus. They can serve as input features for algorithms of natural language processing (NLP) tasks. High quality word embeddings have been proven helpful in many NLP tasks (Collobert and Weston, 2008;Turian et al., 2010;Collobert et al., 2011;Maas et al., 2011;Chen and Manning, 2014). Recently, with the development of deep learning, many novel neural network architectures are proposed for training high quality word embeddings (Mikolov et al., 2013a;Mikolov et al., 2013b).
However, since natural language is intrinsically ambiguous, learning one vector for each word may not cover all the senses of the word. In the case of a multi-sense word, the learned vector will be around the average of all the senses of the word in the embedding space, and therefore may not be a good representation of any of the senses. A possible solution is sense embedding which trains a vector for each sense of a word. There are two key steps in training sense embeddings. First, we need to perform word sense disambiguation (WSD) or word sense induction (WSI) to determine the senses of words in the training corpus. Then, we need to train embedding vectors for word senses according to their contexts.
Early work on sense embedding (Reisinger and Mooney, 2010;Huang et al., 2012;Neelakantan et al., 2014;Kageback et al., 2015;Li and Jurafsky, 2015) proposes context clustering methods which determine the sense of a word by clustering aggregated embeddings of words in its context. This kind of methods is heuristic in nature and relies on external knowledge from lexicon like WordNet (Miller, 1995).
Recently, sense embedding methods based on complete probabilistic models and well-defined learning objective functions (Tian et al., 2014;Bartunov et al., 2016;Jauhar et al., 2015) become more popular. These methods regard the choice of senses of the words in a sentence as hidden variables. Learning is therefore done with expectationmaximization style algorithms, which alternate between inferring word sense choices in the training corpus and learning sense embeddings.
A common problem with these methods is that they model the sense embedding of each center word dependent on the word embeddings of its context words. As we previously explained, word embedding of a polysemous word is not a good representation and may negatively influence the quality of inference and learning. Furthermore, these methods choose the sense of each word in a sentence independently, ignoring the dependency that may exist between the sense choices of neighboring words. We argue that such dependency is important in word sense disambiguation and therefore helpful in learning sense embeddings. For example, consider the sentence "He cashed a check at the bank". Both "check" and "bank" are ambiguous here. Although the two words hint at banking related senses, the hint is not decisive (as an alternative interpretation, they may represent a check mark at a river bank). Fortunately, "cashed" is not ambiguous and it can help disambiguate "check". However, if we consider a small context window in sense embedding, then "cashed" cannot directly help disambiguate "bank". We need to rely on the dependency between the sense choices of "check" and "bank" to disambiguate "bank".
In this paper, we propose a novel probabilistic model for sense embedding that takes into account the dependency between sense choices of neighboring words. We do not learn any word embeddings in our model and hence avoid the problem with embedding polysemous words discussed above. Our model has a similar structure to a high-order hidden Markov model. It contains a sequence of observable words and latent senses and models the dependency between each word-sense pair and between neighboring senses in the sequence. The energy of neighboring senses can be modeled using existing word embedding approaches such as CBOW and Skip-gram (Mikolov et al., 2013a;Mikolov et al., 2013b). Given the model and a sentence, we can perform exact inference using dynamic programming and get the optimal sense sequence of the sentence. Our model can be learned from an unannotated corpus by optimizing a max-margin objective using an algorithm similar to hard-EM.
Our main contributions are the following: 1. We propose a complete probabilistic model for sense embedding. Unlike previous work, we model the dependency between sense choices of neighboring words and do not learn sense embeddings dependent on problematic word embeddings of polysemous words.
2. Based on our proposed model, we derive an exact inference algorithm and a max-margin learning algorithm which do not rely on external knowledge from any knowledge base or lexicon (except that we determine the numbers of senses of polysemous words according to an existing sense inventory).
3. The performance of our model on contextual word similarity task is competitive with previous work and we obtain a 13% relative gain compared with previous state-of-the-art methods on the word sense induction task of SemEval-2013.
The rest of this paper is organized as follows. We introduce related work in section 2. Section 3 describes our models and algorithms in detail. We present our experiments and results in section 4. In section 5, a conclusion is given.

Related Work
Distributed representation of words (aka word embedding) was proposed in 1986 (Hinton, 1986;Rumelhart et al., 1986). In 2003, Bengio et al. (2003) proposed a neural network architecture to train language models which produced word embeddings in the neural network. Mnih and Hinton (2007) replaced the global normalization layer of Bengio's model with a tree-structure to accelerate the training process. Collobert and Weston (2008) introduced a max-margin objective function to replace the most computationally expensive maxlikelihood objective function. Recently proposed Skip-gram model, CBOW model and GloVe model (Mikolov et al., 2013a;Mikolov et al., 2013b;Pennington et al., 2014) were more efficient than traditional models by introducing a log-linear layer and making it possible to train word embeddings with a large scale corpus. With the development of neural network and deep learning techniques, there have been a lot of work based on neural network models to obtain word embedding (Turian et al., 2010;Collobert et al., 2011;Maas et al., 2011;Chen and Manning, 2014). All of them have proven that word embedding is helpful in NLP tasks.
However, the models above assumed that one word has only one vector as its representation which is problematic for polysemous words. Reisinger and Mooney (2010) proposed a method for constructing multiple sense-specific representation vectors for one word by performing word sense disambiguation with context clustering. Huang et al. (2012) further extended this context clustering method and incorporated global context to learn multi-prototype representation vectors.  extended the context clustering method and performed word sense disambiguation according to sense glosses from WordNet (Miller, 1995). Neelakantan et al. (2014) proposed an extension of the Skip-gram model combined with context clustering to estimate the number of senses for each word as well as learn sense embedding vectors. Instead of performing word sense disambiguation tasks, Kageback et al. (2015) proposed the instance-context embedding method based on context clustering to perform word sense induction tasks. Li and Jurafsky (2015) introduced a multi-sense embedding model based on the Chinese Restaurant Process and applied it to several natural language understanding tasks.
Since the context clustering based models are heuristic in nature and rely on external knowledge, recent work tends to create probabilistic models for learning sense embeddings. Tian et al. (2014) proposed a multi-prototype Skip-gram model and designed an Expectation-Maximization (EM) algorithm to do word sense disambiguation and learn sense embedding vectors iteratively. Jauhar et al. (2015) extended the EM training framework and retrofitted embedding vectors to the ontology of WordNet. Bartunov et al. (2016) proposed a nonparametric Bayesian extension of Skip-gram to automatically learn the required numbers of representations for all words and perform word sense induction tasks.

Context-Dependent Sense Embedding Model
We propose the context-dependent sense embedding model for training high quality sense embeddings which takes into account the dependency between sense choices of neighboring words. Unlike pervious work, we do not learn any word embeddings in our model and hence avoid the problem with embedding polysemous words discussed previously. In this section, we will introduce our model and describe our inference and learning algorithms.

Model
We begin with the notation in our model. In a sentence, let w i be the i th word of the sentence and s i be the sense of the i th word. S(w) denotes the set of all the senses of word w. We assume that the sets of senses of different words do not overlap. Therefore, in this paper a word sense can be seen as a lexeme of the word (Rothe and Schutze, 2015). Our model can be represented as a Markov network shown in Figure 1. It is similar to a highorder hidden Markov model. The model contains a sequence of observable words (w 1 , w 2 , . . .) and latent senses (s 1 , s 2 , . . .). It models the dependency between each word-sense pair and between neighboring senses in the sequence. The energy function is formulated as follows: Here w = {w i |1 ≤ i ≤ l} is the set of words in a sentence with length l and s = {s i |1 ≤ i ≤ l} is the set of senses. The function E 1 models the dependency between a word-sense pair. As we assume that the sets of senses of different words do not overlap, we can formulate E 1 as follows: Here we assume that all the matched word-sense pairs have the same energy, but it would also be interesting to model the degrees of matching with different energy values in E 1 . In Equation 1, the function E 2 models the compatibility of neighboring senses in a context window with fixed size k. Existing embedding approaches like CBOW and Skipgram (Mikolov et al., 2013a;Mikolov et al., 2013b) can be used here to define E 2 . The formulation using CBOW is as follows: Here V (s) and V (s) are the input and output embedding vectors of sense s. The function σ is an activation function and we use the sigmoid function here in our model. The formulation using Skip-gram can be defined in a similar way:

Inference
In this section, we introduce our inference algorithm. Given the model and a sentence w, we want to infer the most likely values of the hidden variables (i.e. the optimal sense sequence of the sentence) that minimize the energy function in Equation 1: We use dynamic programming to do inference which is similar to the Viterbi algorithm of the hidden Markov model. Specifically, for every valid assignment A i−2k , . . . , A i−1 of every subsequence of senses s i−2k , . . . , s i−1 , we define m(A i−2k , . . . , A i−1 ) as the energy of the best sense sequence up to position i − 1 that is consistent with the assignment A i−2k , . . . , A i−1 . We start with m(A 1 , . . . , A 2k ) = 0 and then recursively compute m in a left-to-right forward process based on the update formula: Once we finish the forward process, we can retrieve the best sense sequence with a backward process. The time complexity of the algorithm is O(n 4k l) where n is the maximal number of senses of a word. Because most words in a typical sentence have either a single sense or far less than n senses, the actual running time of the algorithm is very fast.

Learning
In this section, we introduce our unsupervised learning algorithm. In learning, we want to learn all the input and output sense embedding vectors that optimize the following max-margin objective function: Here Θ is the set of all the parameters including V and V for all the senses. C is the set of training sentences. Our learning objective is similar to the negative sampling and max-margin objective proposed for word embedding (Collobert and Weston, 2008). S neg (w i ) denotes the set of negative samples of senses of word w i which is defined with the following strategy. For a polysemous word w i , S neg (w i ) = S(w i )\{s i }. For the other words with a single sense, S neg (w i ) is a set of randomly selected senses of a fixed size.
The objective in Equation 7 can be optimized by coordinate descent which in our case is equivalent to the hard Expectation-Maximization algorithm. In the hard E step, we run the inference algorithm using the current model parameters to get the optimal sense sequences of the training sentences. In the M step, with the sense sequences s of all the sentences fixed, we learn sense embedding vectors. Assume we use the CBOW model for E 2 (Equation 3), then the M-step objective function is as follows: Here E 1 is omitted because the sense sequences produced from the E-step always have zero E 1 value. Similarly, if we use the Skip-gram model for E 2 (Equation 4), then the M-step objective function is: We optimize the M-step objective function using stochastic gradient descent.
We use a mini batch version of the hard EM algorithm. For each sentence in the training corpus, we run E-step to infer its sense sequence and then immediately run M-step (for 1 iteration of stochastic gradient descent) to update the model parameters based on the senses in the sentence. Therefore, the batch size of our algorithm depends on the length of each sentence.
The advantage of using mini batch is twofold. First, while our learning objective is highly nonconvex (Tian et al., 2014), the randomness in mini batch hard EM may help us avoid trapping into local optima. Second, the model parameters are updated more frequently in mini batch hard EM, resulting in faster convergence.
Note that before running hard-EM, we need to determine, for each word w, the size of S(w). In our experiments, we used the sense inventory provided by Coarse-Grained English All-Words Task of SemEval-2007 Task 07 (Navigli et al., 2007) to determine the number of senses for each word. The sense inventory is a coarse version of WordNet sense inventory. We do not use the WordNet sense inventory because the senses in WordNet are too finegrained and are difficult to recognize even for human annotators (Edmonds and Kilgarriff, 2002). Since we do not link our learned senses with external sense inventories, our approach can be seen as performing WSI instead of WSD.

Experiments
This section presents our experiments and results. First, we describe our experimental setup including the training corpus and the model configuration. Then, we perform a qualitative evaluation on our model by presenting the nearest neighbors of senses of some polysemous words. Finally, we introduce two different tasks and show the experimental results on these tasks respectively.

Training Corpus
Our training corpus is the commonly used Wikipedia corpus. We dumped the October 2015 snapshot of the Wikipedia corpus which contains 3.6 million articles. In our experiments, we removed the infrequent words with less than 20 occurrences and the training corpus contains 1.3 billion tokens.

Configuration
In our experiments, we set the context window size to 5 (5 words before and after the center word). The embedding vector size is set to 300. The size of negative sample sets of single-sense words is set to 5. We trained our model using AdaGrad stochastic gradient decent (Duchi et al., 2010) with initial learning rate set to 0.025. Our configuration is similar to that of previous work.
Similar to Word2vec, we initialized our model by randomizing the sense embedding vectors. The number of senses of all the words is determined with the sense inventory provided by Coarse-Grained English All-Words Task of SemEval-2007 Task 07 (Navigli et al., 2007) as we explained in section 3.3.

Case Study
In this section, we give a qualitative evaluation of our model by presenting the nearest neighbors of the senses of some polysemous words. Table 1 shows the results of our qualitative evaluation. We list several polysemous words in the table, and for each word, some typical senses of the word are picked. The nearest neighbors of each sense are listed aside. We used the cosine distance to calculate the distance between sense embedding vectors and find the nearest neighbors.
In Table 1, we can observe that our model produces good senses for polysemous words. For example, the word "bank" can be seen to have three different sense embedding vectors. The first one means the financial institution. The second one means the sloping land beside water. The third one means the action of tipping laterally.

Word Similarity in Context
This section gives a quantitative evaluation of our model on word similarity tasks. Word similarity tasks evaluate a model's performance with the Spearman's rank correlation between the similarity scores of pairs of words given by the model and the manual labels. However, traditional word similarity tasks like Wordsim-353 (Finkelstein et al., 2001) are not suitable for evaluating sense embedding models because these datasets do not include enough ambiguous words and there is no context information for the models to infer and disambiguate the senses of the words. To overcome this issue, Huang et al. (2012) released a new dataset named Stanford's Contextual Word Similarities (SCWS) dataset. The dataset consists of 2003 pairs of words along with human labelled similarity scores and the sentences containing these words.
Given a pair of words and their contexts, we can perform inference using our model to disambiguate the questioned words. A similarity score can be calculated with the cosine distance between the two embedding vectors of the inferred senses of the questioned words. We also propose another method for calculating similarity scores. In the inference process, we compute the energy of each sense choice of the questioned word and consider the negative energy as the confidence of the sense choice. Then we calculate the cosine similarity between all pairs of senses of the questioned words and compute the average of similarity weighted by the confidence of the senses. The first method is named HardSim and the   Table 2 shows the results of our contextdependent sense embedding models on the SCWS dataset. In this table, ρ refers to the Spearman's rank correlation and a higher value of ρ indicates better performance. The baseline performances are from Huang et al. (2012), , Neelakantan et al. (2014), Li and Jurafsky (2015), Tian et al. (2014) and Bartunov et al. (2016). Here Ours + CBOW denotes our model with a CBOW based energy function and Ours + Skip-gram denotes our model with a Skip-gram based energy function. The results above the thick line are the models based on context clustering methods and the results below the thick line are the probabilistic models including ours. The similarity metrics of context clustering based models are AvgSim and AvgSimC proposed by Reisinger and Mooney (2010). Tian et al. (2014) propose two metrics Model M and Model W which are similar to our HardSim and SoftSim metrics.
From Table 2, we can observe that our model outperforms the other probabilistic models and is not as good as the best context clustering based model. The context clustering based models are overall better than the probabilistic models on this task. A possible reason is that most context clustering based methods make use of more external knowledge than probabilistic models. However, note that Faruqui et al. (2016) presented several problems associated with the evaluation of word vectors on word similarity datasets and pointed out that the use of word similarity tasks for evaluation of word vectors is not sustainable. Bartunov et al. (2016) also suggest that SCWS should be of limited use for evaluating word representation models. Therefore, the results on this task shall be taken with caution. We consider that more realistic natural language processing tasks like word sense induction are better for evaluating sense embedding models.

Word Sense Induction
In this section, we present an evaluation of our model on the word sense induction (WSI) tasks. The WSI task aims to discover the different meanings for words used in sentences. Unlike a word sense disambiguation (WSD) system, a WSI system does not link the sense annotation results to an existing sense inventory. Instead, it produces its own sense inventory and links the sense annotation results to this sense inventory. Our model can be seen as a WSI system, so we can evaluate our model with WSI tasks.
We used the dataset from task 13 of SemEval-2013 as our evaluation set (Jurgens and Klapaftis, 2013). The dataset contains 4664 instances inflected from one of the 50 lemmas. Both single-sense instances and instances with a graded mixture of senses are included in the dataset. In this paper, we only consider the single sense instances. Jurgens and Klapaftis (2013) propose two fuzzy measures named Fuzzy B-Cubed (FBC) and Fuzzy Normalized Mutual Information (FNMI) for comparing fuzzy sense assignments from WSI systems. the FBC measure summarizes the performance per instance while the FNMI measure is based on sense clusters rather than instances. Table 3 shows the results of our contextdependent sense embedding models on this dataset. Here HM is the harmonic mean of FBC and FNMI. The result of AI-KU is from Baskaya et al. (2013), MSSG is from Neelakantan et al. (2014), ICEonline and ICE-kmeans are from Kageback et al. (2015). Our models are denoted in the same way as in the previous section.

Conclusion
In this paper we propose a novel probabilistic model for learning sense embeddings. Unlike previous work, we do not learn sense embeddings dependent on word embeddings and hence avoid the problem with inaccurate embeddings of polysemous words. Furthermore, we model the dependency between sense choices of neighboring words which can help us disambiguate multiple ambiguous words in a sentence. Based on our model, we derive a dynamic programming inference algorithm and an EM-style unsupervised learning algorithm which do not rely on external knowledge from any knowledge base or lexicon except that we determine the number of senses of polysemous words according to an existing sense inventory. We evaluate our model both qualitatively by case studying and quantitatively with the word similarity task and the word sense induction task. Our model is competitive with previous work on the word similarity task. On the word sense induction task, our model outperforms the state-ofthe-art model and achieves a 13% relative gain. For the future work, we plan to try learning our model with soft EM. Besides, we plan to use shared senses instead of lexemes in our model to improve the generality of our model. Also, we will study unsupervised methods to link the learned senses to existing inventories and to automatically determine the numbers of senses. Finally, we plan to evaluate our model with more NLP tasks.