Learning Topic-Sensitive Word Representations

Distributed word representations are widely used for modeling words in NLP tasks. Most of the existing models generate one representation per word and do not consider different meanings of a word. We present two approaches to learn multiple topic-sensitive representations per word by using Hierarchical Dirichlet Process. We observe that by modeling topics and integrating topic distributions for each document we obtain representations that are able to distinguish between different meanings of a given word. Our models yield statistically significant improvements for the lexical substitution task indicating that commonly used single word representations, even when combined with contextual information, are insufficient for this task.


Introduction
Word representations in the form of dense vectors, or word embeddings, capture semantic and syntactic information (Mikolov et al., 2013a;Pennington et al., 2014) and are widely used in many NLP tasks (Zou et al., 2013;Levy and Goldberg, 2014;Tang et al., 2014;Gharbieh et al., 2016).Most of the existing models generate one representation per word and do not distinguish between different meanings of a word.However, many tasks can benefit from using multiple representations per word to capture polysemy (Reisinger and Mooney, 2010).There have been several attempts to build repositories for word senses (Miller, 1995;Navigli and Ponzetto, 2010), but this is laborious and limited to few languages.Moreover, defining a universal set of word senses is challenging as polysemous words can exist at many levels of granularity (Kilgarriff, 1997;Navigli, 2012).
In this paper, we introduce a model that uses a nonparametric Bayesian model, Hierarchical Dirichlet Process (HDP), to learn multiple topicsensitive representations per word.Yao and van Durme (2011) show that HDP is effective in learning topics yielding state-of-the-art performance for sense induction.However, they assume that topics and senses are interchangeable, and train individual models per word making it difficult to scale to large data.Our approach enables us to use HDP to model senses effectively using large unannotated training data.We investigate to what extent distributions over word senses can be approximated by distributions over topics without assuming these concepts to be identical.The contributions of this paper are: (i) We propose three unsupervised, language-independent approaches to approximate senses with topics and learn multiple topic-sensitive embeddings per word.(ii) We show that in the Lexical Substitution ranking task our models outperform two competitive baselines.

Topic-Sensitive Representations
In this section we describe the proposed models.To learn topics from a corpus we use HDP (Teh et al., 2006;Lau et al., 2014).The main advantage of this model compared to non-hierarchical methods like the Chinese Restaurant Process (CRP) is that each document in the corpus is modeled using a mixture model with topics shared between all documents (Teh et al., 2005;Brody and Lapata, 2009).HDP yields two sets of distributions that we use in our methods: distributions over topics for words in the vocabulary, and distributions over topics for documents in the corpus.
Similarly to Neelakantan et al. (2014), we use neighboring words to detect the meaning of the context, however, we also use the two HDP dis- tributions.By doing so, we take advantage of the topic of the document beyond the scope of the neighboring words, which is helpful when the immediate context of the target word is not sufficiently informative.We modify the Skipgram model (Mikolov et al., 2013a) to obtain multiple topic-sensitive representations per word type using topic distributions.In addition, we do not cluster context windows and train for different senses of the words individually.This reduces the sparsity problem and provides a better representation estimation for rare words.We assume that meanings of words can be determined by their contextual information and use the distribution over topics to differentiate between occurrences of a word in different contexts, i.e., documents with different topics.We propose three different approaches (see Figure 1): two methods with hard topic labeling of words and one with soft labeling.

Hard Topic-Labeled Representations
The trained HDP model can be used to hard-label a new corpus with one topic per word through sampling.Our first model variant (Figure 1(a)) relies on hard labeling by simply considering each word-topic pair as a separate vocabulary entry.To reduce sparsity on the context side and share the word-level information between similar contexts, we use topic-sensitive representations for target words (input to the network) and standard, i.e., unlabeled, word representations for context words (output).Note that this results in different input and output vocabularies.The training objective is then to maximize the log-likelihood of context words w i`j given the target word-topic pair w τ i : where I is the number of words in the training corpus, c is the context size and τ is the topic assigned to w i by HDP sampling.The embedding of a word in context hpw i q is obtained by simply extracting the row of the input lookup table (r) corresponding to the HDP-labeled word-topic pair: A possible shortcoming of the HTLE model is that the representations are trained separately and information is not shared between different topicsensitive representations of the same word.To address this issue, we introduce a model variant that learns multiple topic-sensitive word representations and generic word representations simultaneously (Figure 1(b)).In this variant (HTLEadd), the target word embedding is obtained by adding the word-topic pair representation (r 1 ) to the generic representation of the corresponding word (r 0 ):

Soft Topic-Labeled Representations
The model variants above rely on the hard labeling resulting from HDP sampling.As a soft alternative to this, we can directly include the topic distributions estimated by HDP for each document (Figure 1(c)).Specifically, for each update, we use the topic distribution to compute a weighted sum over the word-topic representations (r 2 ): where T is the total number of topics, d i the document containing w i , and ppτ k |d i q the probability assigned to topic τ k by HDP in document d i .The training objective for this model is: where τ is the topic of document d i learned by HDP.The STLE model has the advantage of directly applying the distribution over topics in the Skipgram model.In addition, for each instance, we update all topic representations of a given word with non-zero probabilities, which has the potential to reduce the sparsity problem.

Embeddings for Polysemous Words
The representations obtained from our models are expected to capture the meaning of a word in different topics.We now investigate whether these representations can distinguish between different word senses.Table 1 provides examples of nearest neighbors.For comparison we include our own baseline, i.e., embeddings learned with Skipgram on our corpus, as well as Word2Vec (Mikolov et al., 2013b) and GloVe embeddings (Pennington et al., 2014) pre-trained on large data.
In the first example, the word bat has two different meanings: animal or sports device.One can see that the nearest neighbors of the baseline and pre-trained word representations either center around one primary, i.e., most frequent, meaning of the word, or it is a mixture of different meanings.The topic-sensitive representations, on the other hand, correctly distinguish between the two different meanings.A similar pattern is observed for the word jaguar and its two meanings: car or animal.The last example, appeal, illustrates a case where topic-sensitive embeddings are not clearly detecting different meanings of the word, despite having some correct words in the lists.Here, the meaning attract does not seem to be captured by any embedding set.
These observations suggest that topic-sensitive representations capture different word senses to some extent.To provide a systematic validation of our approach, we now investigate whether topicsensitive representations can improve tasks where polysemy is a known issue.

Evaluation
In this section we present the setup for our experiments and empirically evaluate our approach on the context-aware word similarity and lexical substitution tasks.

Experimental setup
All word representations are learned on the English Wikipedia corpus containing 4.8M documents (1B tokens).The topics are learned on a 100K-document subset of this corpus using the HDP implementation of Teh et al. (2006).Once the topics have been learned, we run HDP on the whole corpus to obtain the word-topic labeling (see Section 2.1) and the document-level topic distributions (Section 2.2).We train each model vari- ant with window size c " 10 and different embedding sizes (100, 300, 600) initialized randomly.We compare our models to several baselines: Skipgram (SGE) and the best-performing multisense embeddings model per word type (MSSG) (Neelakantan et al., 2014).All model variants are trained on the same training data with the same settings, following suggestions by Mikolov et al. (2013a) and Levy et al. (2015).For MSSG we use the best performing similarity measure (avgSimC) as proposed by Neelakantan et al. (2014).

Context-Aware Word Similarity Task
Despite its shortcomings (Faruqui et al., 2016), word similarity remains the most frequently used method of evaluation in the literature.There are multiple test sets available but in almost all of them word pairs are considered out of context.To the best of our knowledge, the only word similarity data set providing word context is SCWS (Huang et al., 2012).To evaluate our models on SCWS, we run HDP on the data treating each word's context as a separate document.We compute the similarity of each word pair as follows: Simpw 1 , w 2 q " cosphpw 1 q, hpw 2 qq where hpw i q refers to any of the topic-sensitive representations defined in Section 2. Note that w 1 and w 2 can refer to the same word.
Table 2 provides the Spearman's correlation scores for different models against the human ranking.We see that with dimensions 100 and 300, two of our models obtain improvements over the baseline.The MSSG model of Neelakantan et al. (2014) performs only slightly better than our HLTE model by requiring considerably more parameters (600 vs. 100 embedding size).

Lexical Substitution Task
This task requires one to identify the best replacements for a word in a sentential context.The pres- ence of many polysemous target words makes this task more suitable for evaluating sense embedding.Following Melamud et al. (2015) we pool substitutions from different instances and rank them by the number of annotators that selected them for a given context.We use two evaluation sets: LS-SE07 (McCarthy and Navigli, 2007), and LS-CIC (Kremer et al., 2014).Unlike previous work (Szarvas et al., 2013;Kremer et al., 2014;Melamud et al., 2015) we do not use any syntactic information, motivated by the fact that high-quality parsers are not available for most languages.The evaluation is performed by computing the Generalized Average Precision (GAP) score (Kishida, 2005).We run HDP on the evaluation set and compute the similarity between target word w t and each substitution w s using two different inference methods in line with how we incorporate topics during training: Sampled (Smp): SimTSEpws, wtq " cosphpw τ s q, hpw τ 1 t qq `řc cosphpw τ s q, opwcqq C Expected (Exp): SimTSEpws, wtq " ÿ τ,τ 1 ppτ q ppτ 1 q cosphpw τ s q, hpw τ 1 t qq `řτ,c cosphpw τ s q, opwcqq ppτ q C where hpw τ s q and hpw τ 1 t q are the representations for substitution word s with topic τ and target word t with topic τ 1 respectively (cf.Section 2), w c are context words of w t taken from a sliding window of the same size as the embeddings, opw c q is the context (i.e., output) representation of w c , and C is the total number of context words.Note that these two methods are consistent with how we train HTLE and STLE.
The sampled method, similar to HTLE, uses the HDP model to assign topics to word occurrences during testing.The expected method, similar to STLE, uses the HDP model to learn the probability distribution of topics of the context sentence and uses the entire distribution to compute the similarity.For the Skipgram baseline we compute the similarity Sim SGE+C pw s , w t q as follows: cosphpwsq, hpwtqq `řc cosphpwsq, opwcqq C which uses the similarity between the substitution word and all words in the context, as well as the similarity of target and substitution words.
Table 3 shows the GAP scores of our models and baselines. 1One can see that all models using multiple embeddings per word perform better than SGE.Our proposed models outperform both SGE and MSSG in both evaluation sets, with more pronounced improvements for LS-CIC.We further observe that our expected method is more robust and performs better for all embedding sizes.
Table 4 shows the GAP scores broken down by the main word classes: noun, verb, adjective, and adverb.With 100 dimensions our best model (HTLE) yields improvements across all POS tags, with the largest improvements for adverbs and smallest improvements for adjectives.When increasing the dimension size of embeddings the improvements hold up for all POS tags apart from adverbs.It can be inferred that larger dimension sizes capture semantic similarities for adverbs and context words better than other parts-of-speech categories.Additionally, we observe for both evaluation sets that the improvements are preserved when increasing the embedding size.

Related Work
While the most commonly used approaches learn one embedding per word type (Mikolov et al., 2013a;Pennington et al., 2014), recent studies have focused on learning multiple embeddings per word due to the ambiguous nature of language (Qiu et al., 2016).Huang et al. (2012) cluster word contexts and use the average embedding of each cluster as word sense embeddings, which yields improvements on a word similarity task.Neelakantan et al. (2014) propose two approaches, both based on clustering word contexts: In the first, they fix the number of senses manually, and in the second, they use an ad-hoc greedy procedure that allocates a new representation to a word if existing representations explain the context below a certain threshold.Li and Jurafsky (2015) used a CRP model to distinguish between senses of words and train vectors for senses, where the number of senses is not fixed.They use two heuristic approaches for assigning senses in a context: 'greedy' which assigns the locally optimum sense label to each word, and 'expectation' which computes the expected value for a word in a given context with probabilities for each possible sense.

Conclusions
We have introduced an approach to learn topicsensitive word representations that exploits the document-level context of words and does not require annotated data or linguistic resources.Our evaluation on the lexical substitution task suggests that topic distributions capture word senses to some extent.Moreover, we obtain statistically significant improvements in the lexical substitution task while not using any syntactic information.The best results are achieved by our hard topiclabeled model which learns topic-sensitive representations by assigning topics to target words.

Table 2 :
Spearman's rank correlation performance for the Word Similarity task on SCWS.

Table 3 :
GAP scores on LS-SE07 and LS-CIC sets.For SGE + C we use the context embeddings to disambiguate the substitutions.Improvements over the best baseline (MSSG) are marked at p ă .01 and at p ă .05.

Table 4 :
GAP scores on the candidate ranking task on LS-SE07 for different part-of-speech categories.