Query Expansion with Locally-Trained Word Embeddings

Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings.


INTRODUCTION
Continuous space embeddings such as word2vec [29] or GloVe [33] project terms in a vocabulary to a dense, lower dimensional space.Recent results in the natural language processing community demonstrate the effectiveness of these methods for analogy and word similarity tasks.In general, these approaches provide global representations of words; each word has a fixed representation, regardless of any discourse context.While a global representation provides some advantages, language use can vary dramatically by topic.For example, ambiguous terms can easily be disambiguated given local information in immediately surrounding words [17,49].The window-based training of word2vec style algorithms exploits this distributional property.
A global word embedding, even when trained using local windows, risks capturing only coarse representations of those topics dominant in the corpus.While a particular embedding may be appropriate for a specific word within a sentence-length context globally, it may be entirely inappropriate within a specific topic.Gale et al. refer to this as the 'one sense per discourse' property [15].Previous work by Yarowsky demonstrates that this property can be successfully combined with information from nearby terms for word sense disambiguation [50].Our work extends this approach to word2vec-style training in the context word similarity.
For many tasks that require topic-specific linguistic analysis, we argue that topic-specific representations should outperform global representations.Indeed, it is difficult to imagine a natural language processing task that would not benefit from an understanding of the local topical structure.Our work focuses on a query expansion, an information retrieval task where we can study different lexical similarity methods with an extrinsic evaluation metric (i.e.retrieval metrics).Recent work has demonstrated that similarity based on global word embeddings can be used to outperform classic pseudo-relevance feedback techniques [40,2].
We propose that embeddings be learned on topically-constrained corpora, instead of large topically-unconstrained corpora.In a retrieval scenario, this amounts to retraining an embedding on documents related to the topic of the query.We present local embeddings which capture the nuances of topic-specific language better than global embeddings.There is substantial evidence that global methods underperform local methods for information retrieval tasks such as query expansion [48], latent semantic analysis [20,37,39], cluster-based retrieval [41,42,47], and term clustering [5].We demonstrate that the same holds true when using word embeddings for text retrieval.

MOTIVATION
For the purpose of motivating our approach, we will restrict ourselves to word2vec although other methods behave similarly [26].These algorithms involve discriminatively training a neural network to predict a word given small set of context words.More formally, given a target word w and observed context c, the instance loss is defined as, (w, c) = log σ(φ(w) • ψ(c)) where φ : V → k projects a term into a k-dimensional embedding space, ψ : V m → k projects a set of m terms into a k-dimensional embedding space, and w is a randomly sampled 'negative' context.The parameter η controls the sampling of random negative terms.These matrices are estimated over a set of contexts sampled from a large corpus and minimize the expected loss, where pc is the distribution of word-context pairs in the training corpus and can be estimated from corpus statistics.While using corpus statistics may make sense absent any other information, oftentimes we know that our analysis will be topically constrained.For example, we might be analyzing the 'sports' documents in a collection.The language in this domain is more specialized and the distribution over word-context pairs is unlikely to be similar to pc(w, c).In fact, prior work in information retrieval suggests that documents on subtopics in a collection have very different unigram distributions compared to the whole corpus [9].Let pt(w, c) be the probability of observing a word-context pair conditioned on the topic t.
The expected loss under this distribution is [38], In general, if our corpus consists of sufficiently diverse data (e.g.Wikipedia), the support of pt(w, c) is much smaller than and contained in that of pc(w, c).The loss, , of a context that occurs more frequently in the topic, will be amplified by the importance weight ω = p t (w,c) pc(w,c) .Because topics require specialized language, this is likely to occur; at the same time, these contexts are likely to be underemphasized in training a model according to Equation 1.
In order to quantify this, we took a topic from a TREC ad hoc retrieval collection (see Section 5 for details) and computed the importance weight for each term occurring in the set of on-topic documents.The histogram of weights ω is presented in Figure 1.While larger probabilities are expected since the size of a topic-constrained vocabulary is smaller, there are a non-trivial number of terms with much larger importance weights.If the loss, (w), of a word2vec embedding is worse for these words with low pc(w), then we expect these errors to be exacerbated for the topic.
Of course, these highly weighted terms may have a low value for pt(w) but a very high value relative to the corpus.We can adjust the weights by considering the pointwise Kullback-Leibler divergence for each word w, Words which have a much higher value of pt(w) than pc(w) and have a high absolute value of pt(w) will have high pointwise KL divergence.Figure 2 shows the divergences for the top 100 most frequent terms in pt(w).The higher ranked terms (i.e.good query expansion candidates) tend to have much higher probabilities than found in pc(w).If the loss on those words is large, this may result in poor embeddings for the most important words for the topic.A dramatic change in distribution between the corpus and the topic has implications for performance precisely because of the objective used by word2vec (i.e.Equation 1).The  training emphasizes word-context pairs occurring with high frequency in the corpus.We will demonstrate that, even with heuristic downsampling of frequent terms in word2vec, these techniques result in inferior performance for specific topics.
Thus far, we have sketched out why using the corpus distribution for a specific topic may result in undesirable outcomes.However, it is even unclear that pt(w|c) = pc(w|c).In fact, we suspect that pt(w|c) = pc(w|c) because of the 'one sense per discourse' claim [15].We can qualitatively observe the difference in pc(w|c) and pt(w|c) by training two word2vec models: the first on the large, generic Gigaword corpus and the second on a topically-constrained subset of the gigaword.We present the most similar terms to 'cut' using both a global embedding and a topic-specific embedding in Figure 3.In this case, the topic is 'gasoline tax'.As we can see, the 'tax cut' sense of 'cut' is emphasized in the topic-specific embedding.

LOCAL WORD EMBEDDINGS
The previous section described several reasons why a global embedding may result in overgeneral word embeddings.In order to perform topic-specific training, we need a set of topic-specific documents.In information retrieval scenarios users rarely provide the system with examples of topicspecific documents, instead providing a small set of keywords.
Fortunately, we can use information retrieval techniques to generate a query-specific set of topical documents.Specifically, we adopt a language modeling approach to do so [8].In this retrieval model, each document is represented as a maximum likelihood language model estimated from document term frequencies.Query language models are estimated similarly, using term frequency in the query.A document score then, is the Kullback-Leibler divergence between the query and document language models, Documents whose language models are more similar to the query language model will have a lower KL divergence score.
For consistency with prior work, we will refer to this as the query likelihood score of a document.
The scores in Equation 4 can be passed through a softmax function to derive a multinomial over the entire corpus [25], Recall in Section 2 that training a word2vec model weights word-context pairs according to the corpus frequency.Our query-based multinomial, p(d), provides a weighting function capturing the documents relevant to this topic.Although an estimation of the topic-specific documents from a query will be imprecise (i.e.some nonrelevant documents will be scored highly), the language use tends to be consistent with that found in the known relevant documents.We can train a local word embedding using an arbitrary optimization method by sampling documents from p(d) instead of uniformly from the corpus.In this work, we use word2vec, although any method that operates on a sample of documents can be used.

QUERY EXPANSION WITH WORD EM-BEDDINGS
When using language models for retrieval, query expansion involves estimating an alternative to pq.Specifically, when each expansion term is associated with a weight, we normalize these weights to derive the expansion language model, p q + .This language model is then interpolated with the original query model, This interpolated language model can then be used with Equation 4 to rank documents [1].We will refer to this as the expanded query score of a document.Now we turn to using word embeddings for query expansion.Let U be an |V| × k term embedding matrix.If q is a |V| × 1 column term vector for a query, then the expansion term weights are UU T q.We then take the top k terms, normalize their weights, and compute p q + (w).
We consider the following alternatives for U.The first approach is to use a global model trained by sampling documents uniformly.The second approach, which we propose in this paper, is to use a local model trained by sampling documents from p(d).

Data
To evaluate the different retrieval strategies described in Section 3, we use the following datasets.Two newswire datasets, trec12 and robust, consist of the newswire documents and associated queries from TREC ad hoc retrieval evaluations.The trec12 corpus consists of Tipster disks 1 and 2; and the robust corpus consists of Tipster disks 4 and 5. Our third dataset, web, consists of the ClueWeb 2009 Category B Web corpus.For the Web corpus, we only retain documents with a Waterloo spam rank above 70. 1 We present corpus statistics in Table 1.
We consider several publicly available global embeddings.We use four GloVe embeddings of different dimensionality trained on the union of Wikipedia and Gigaword documents. 2We use one publicly available word2vec embedding trained on Google News documents. 3We also trained a global embedding for trec12 and robust using the entire corpus.Instead of training a global embedding on the large web collection, we use a GloVe embedding trained on Common Crawl data. 4e train local embeddings with word2vec using one of three retrieval sources.First, we consider documents retrieved from the target corpus of the query (i.e.trec12, robust, or web).We also consider training a local embedding by performing a retrieval on large auxiliary corpora.We use the Gigaword corpus as a large auxiliary news corpus.We hypothesize that retrieving from a larger news corpus will provide substantially more local training data than a target retrieval.We also use a Wikipedia snapshot from December 2014.We hypothesize that retrieving from a large, high fidelity corpus will provide cleaner language than that found in lower fidelity target domains such as the web.Table 1 shows the relative magnitude of these auxiliary corpora compared to the target corpora.
All corpora in Table 1 were stopped using the SMART stopword list5 and stemmed using the Krovetz algorithm [23].We used the Indri implementation for indexing and retrieval. 6

Evaluation
We consider several standard retrieval evaluation metrics, including NDCG@10 and interpolated precision at standard recall points [22,45].NDCG@10 provides insight into performance specifically at higher ranks.An interpolated precision recall graph describes system performance throughout the entire ranked list.
All word2vec training used the publicly available word2vec cbow implementation. 7When training the local models, we sampled 1000 documents from p(d) with replacement.To compensate for the much smaller corpus size, we ran word2vec training for 80 iterations.Local word2vec models use a fixed embedding dimension of 400 although other choices did not significantly affect our results.Unless otherwise noted, default parameter settings were used.
In our experiments, expanded queries rescore the top 1000 documents from an initial query likelihood retrieval.Previous results have demonstrated that this approach results in performance nearly identical with an expanded retrieval at a much lower cost [11].Because publicly available embeddings may have tokenization inconsistent with our target corpora, we restricted the vocabulary of candidate expansion terms to those occurring in the initial retrieval.If a candidate term was not found in the vocabulary of the embedding matrix, we searched for the candidate in a stemmed version of the embedding vocabulary.In the event that the candidate term was still not found after this process, we removed it from consideration.

RESULTS
We present results for retrieval experiments in Table 2.We find that embedding-based query expansion outperforms our query likelihood baseline across all conditions.When using the global embedding, the news corpora benefit from the various embeddings in different situations.Interestingly, for trec12, using an embedding trained on the target corpus significantly outperforms all other global embeddings, despite using substantially less data to estimate the model.While this performance may be due to the embedding having a tokenization consistent with the target corpus, it may also come from the fact that the corpus is more representative of the target documents than other embeddings which rely on online news or are mixed with non-news content.To some extent this supports our desire to move training closer to the target distribution.
Across all conditions, local embeddings significantly outperform global embeddings for query expansion.For our two news collections, estimating the local model using a retrieval from the larger Gigaword corpus led to substantial improvements.This effect is almost certainly due to the Gigaword corpus being similar in writing style to the target corpus but, at the same time, providing significantly more relevant content [12].As a result, the local embedding is trained using a larger variety of topical material than if it were to use a retrieval from the smaller target corpus.An embedding trained with a retrieval from Wikipedia tended to perform worse most likely because the language is dissimilar from news content.Our web collection, on the other hand, benefitted more from embeddings trained using retrievals from the general Wikipedia corpus.The Gigaword corpus was less Figure 4 presents interpolated precision-recall curves comparing the baseline, the best global query expansion method, and the best local query expansion method.Interestingly, although global methods achieve strong performance for NDCG@10, these improvements over the baseline are not reflected in our precision-recall curves.Local methods, on the other hand, almost always strictly dominate both the baseline and global expansion across all recall levels.
The results support the hypothesis that local embeddings provide better similarity measures than global embeddings for query expansion.In order to understand why, we first compare the performance differences between local and global embeddings.Figure 2 suggests that we should adopt a local embedding when the local unigram language model deviates from the corpus language model.To test this, we computed the KL divergence between the local unigram distribution, d p(w|d)p(d), and the corpus unigram language model [9].We hypothesize that, when this value is high, the topic language is different from the corpus language and the global embedding will be inferior to the local embedding.We tested the rank correlation between this KL divergence and the relative performance of the local embedding with respect to the global embedding.These correlations are presented in Table 3.Unfortunately, we find that the correlation is low, although it is positive across collections.
We can also qualitatively analyze the differences in the behavior of the embeddings.If we have access to the set of documents labeled relevant to a query, then we can compute the frequency of terms in this set and consider those terms with high frequency (after stopping and stemming) to be good query expansion candidates.We can then visualize where these terms lie in the global and local embeddings.In Figure 5, we present a two-dimensional projection [44] of terms for the query 'ocean remote sensing', with those good candidates highlighted.Our projection includes the top 50 candidates by frequency and a sample of terms occurring in the query likelihood retrieval.We notice that, in the global embedding, the good candidates are spread out amongst poorer candidates.By contrast, the local embedding clusters the candidates in general but also situates them closely around the query.As a result, we suspect that the similar terms extracted from the local embedding are more likely to include these good candidates.

DISCUSSION
The success of local embeddings on this task should alarm natural language processing researchers using global embeddings as a representational tool.For one, the approach of learning from vast amounts of data is only effective if the   2.
data is appropriate for the task at hand.And, when provided, much smaller high-quality data can provide much better performance.Beyond this, our results suggest that the approach of estimating global representations, while computationally convenient, may overlook insights possible at query time, or evaluation time in general.A similar local embedding approach can be adopted for any natural language processing task where topical locality is expected and can be estimated.Although we used a query to re-weight the corpus in our experiments, we could just as easily use alternative contextual information (e.g. a sentence, paragraph, or document) in other tasks.Despite these strong results, we believe that there are still some open questions in this work.First, although local embeddings provide effectiveness gains, they can be quite inefficient compared to global embeddings.We believe that there is opportunity to improve the efficiency by considering offline computation of local embeddings at a coarser level than queries but more specialized than the corpus.If the retrieval algorithm is able to select the appropriate embedding at query time, we can avoid training the local embedding.Second, although our supporting experiments (Table 3, Figure 5) add some insight into our intuition, the results are not strong enough to provide a solid explanation.Further theoretical and empirical analysis is necessary.

RELATED WORK
Topical adaptation of models.The shortcomings of learning a single global vector representation, especially for polysemic words, have been pointed out before [36].The problem can be addressed by training a global model with multiple vector embeddings per word [35,19] or topic-specific embeddings [27].The number of senses for each word may be fixed [32], or determined using class labels [43].However, to the best of our knowledge, this is the first time that training topic-specific word embeddings has been explored.
Several methods exist in the language modeling community for topic-dependent adaptation of language models [6].These can lead to performance improvements in tasks such as machine translation [51] and speech recognition [31].Topicspecific data may be gathered in advance, by identifying corpus of topic-specific documents.It may also be gathered during the discourse, using multiple hypotheses from N-best lists as a source of topic-specific language.Then a topic-specific language model is trained (or the global model is adapted) online using the topic-specific training data.A topic-dependent model may be combined with the global model using linear interpolation [21] or other more sophisticated approaches [14,24].Similarly to the adaptation work, we use topic-specific documents to train a topicspecific model.In our case the documents come from a first round of retrieval for the user's current query, and the word embedding model is trained based on sentences from the topic-specific document set.Unlike the past work, we do not focus on interpolating the local and global models, although this is a promising area for future work.In the current study we focus on a direct comparison between the local-only and global-only approach, for improving retrieval performance.
Word embeddings for IR.Information Retrieval has a long history of learning representations of words that are lowdimensional dense vectors.These approaches can be broadly classified into two families based on whether they are learnt based on a term-document matrix or term co-occurence data.Using the term-document matrix for embedding leads to several well-studied approaches such as LSA [10], PLSA [18], and LDA [7,46].The performance of these models varies depending on the task, for example they are known to perform poorly for retrieval tasks unless combined with lexical features [3].Term-cooccurence based embeddings, such as word2vec [29,28] and [34], have recently been remarkably popular for many natural language processing and logical reasoning tasks.However, there are relatively less known successful applications of these models in IR.Ganguly et.al. [16] used the word similarity in the word2vec embedding space as a way to estimate term transformation probabilities in a language modelling setting for retrieval.More recently, Nalisnick et.al. [30] proposed to model document aboutness by computing the similarity between all pairs of query and document terms using dual embedding spaces.Both these approaches estimate the semantic relatedness between two terms as the cosine distance between them in the embedding space(s).We adopt a similar notion of term relatedness but focus on demonstrating improved retrieval performance using locally trained embeddings.
Local latent semantic analysis.Despite the mathematical appeal of latent semantic analysis, several experiments suggest that its empirical performance may be no better than that of ranking using standard term vectors [10,13,4].In order to address the coarseness of corpus-level latent semantic analysis, Hull proposed restricting analysis to the documents relevant to a query [20].This approach significantly improved over corpus-level analysis for routing tasks, a result that has been reproduced in consequent research [37,39].Our work can be seen as an extension of these results to more recent techniques such as word2vec.

Figure 1 :
Figure 1: Importance weights for terms occurring in documents related to 'argentina pegging dollar' relative to frequency in gigaword.

Figure 2 :
Figure 2: Pointwise Kullback-Leibler divergence for terms occurring in documents related to 'argentina pegging dollar' relative to frequency in gigaword.global local cutting tax squeeze deficit reduce vote slash budget reduction reduction spend house lower bill halve plan soften spend freeze billion

Figure 3 :
Figure 3: Terms similar to 'cut' for a word2vec model trained on a general news corpus and another trained only on documents related to 'gasoline tax'.

Figure 4 :
Figure 4: Interpolated precision-recall curves for query likelihood, the best global embedding, and the best local embedding from Table2.

Figure 5 :
Figure 5: Global versus local embedding of highly relevant terms.Each point represents a candidate expansion term.Red points have high frequency in the relevant set of documents.White points have low or no frequency in the relevant set of documents.The blue point represents the query.Contours indicate distance from the query.

Table 1 :
Corpora used for retrieval and local embedding training.

Table 3 :
Kendall's τ and Spearman's ρ between improvement in NDCG@10 and local KL divergence with the corpus language model.The improvement is measured for the best local embedding over the best global embedding.

Table 2 :
Retrieval results comparing query expansion based on various global and local embeddings.Bolded numbers indicate the best expansion in that class of embeddings.Wilcoxon signed rank test between bolded numbers indicates statistically significant improvements (p < 0.05) for all collections.