Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pre-trained language model, which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.


Introduction
Information retrieval (IR) is the task of retrieving the most relevant documents, including scientific ones (Boudin et al., 2020;Noh and Kavuluru, 2020), for a given query. IR systems have received considerable attention as they are not only required to search documents for information, but are also used as a core component in various downstream language understanding tasks such as open-domain question answering (Seo et al., 2019;Qu et al., 2020), fact verification (Thorne et al., 2018; and information extraction (Narasimhan et al., 2016;Das et al., 2020).
As the simplest approach to IR tasks, classical term-based ranking models, such as BM25 (Robertson et al., 1994) and Query Likelihood (QL) mod- * Corresponding author els (Zhai and Lafferty, 2017), have been widely used. These term-based ranking models measure the lexical overlaps between query and document pairs using a sparse representation of words, to match the relevant documents for the given query. Notwithstanding their simplicity, they achieve decent performances, even compared to the recent dense representation models (Lin, 2019;Xiong et al., 2020), which require a large number of paired query-document samples. However, these termbased sparse models are intrinsically vulnerable to the vocabulary mismatch problem, which happens when a query and its relevant document are lexically divergent.
Thus, we should address the limitations of both sparse and dense models, about the vocabulary mismatch problem and the need for a large amount of training data, respectively. Along this line, there are methods that expand queries and documents with their relevant terms. They include document expansion methods (Nogueira et al., 2019;Boudin et al., 2020) that introduce additional contextrelated terms to given documents and query expansion methods (Mao et al., 2020;Claveau, 2020) that augment given queries with additional terms. By doing so, we can explicitly generate lexically richer documents or queries.
Compared with query expansion, document expansion has two strengths. First, a document expansion model can generate much more relevant terms for the given document, since documents are generally much longer than queries. Also, documents can be expanded during indexing time so that the responding process for the user's query is not delayed, in contrast to queries that must be expanded during retrieval time. Thus, document expansion is more appropriate for a real-time system, together with making available more context-related words from the given information (Nogueira et al., 2019).
In this work, we focus on document expansion, and propose to abstractly generate the key infor- In our series of letters from you, we look at some of the things you do to make your life better. Figure 1: The overall framework of our Unsupervised Document Expansion with Generation (UDEG), where the example is generated from our framework. Given an original document (green box), our UDEG framework stochastically generates several sentences (orange box) relevant to the given document, and augments the generated sentences to the input document to improve its expressiveness. After every document in the corpus is expanded, documents are indexed in the IR system, and searched in response to the given query. mation corresponding to the given document in an unsupervised manner, henceforth referred to as Unsupervised Document Expansion with Generation (UDEG). We first generate document-related sentences using a pre-trained language model, and then stack up the newly generated sentences on the original documents to enrich the expressiveness of document representation. Specifically, in order to generate sentences containing particular information for the documents, we use a language model that is already trained for summarizing sentences from a sufficient amount of texts. However, such a scheme generates only one static sentence at a time, so we further propose to stochastically generate multiple relevant sentences for the given document. This helps the proposed UDEG framework to minimize the vocabulary mismatch cases by generating many relevant words, which reflect diverse points of view for the given document. The overall UDEG framework is illustrated in Figure 1. We experimentally validate the proposed UDEG framework on standard benchmark datasets for IR tasks, ANTIQUE (Hashemi et al., 2020) and MS MARCO (Nguyen et al., 2016), with five different evaluation metrics. The experimental results show that our framework outperforms all baselines on all evaluation metrics by a large margin. Also, a detailed analysis of UDEG shows that its stochastic generation significantly improves the IR performances, and that our UDEG framework does not depend on specific language models for generation.
Our contributions in this work are threefold: • To mitigate the vocabulary mismatch problem, we present a novel document expansion framework that augments the document with abstractly generated sentences without using paired query-document data for training.
• Under an unsupervised document expansion framework, we generate document-related sentences with a pre-trained language model, and further stochastically perturb the embeddings for more diverse sentences.
• We show that our framework achieves outstanding performances on benchmark datasets for IR tasks with various evaluation metrics.

Related work
Information Retrieval A two-stage pipeline is the most prominent approach for IR. This pipeline first retrieves query-relevant documents with their sparse representations, and then re-ranks them by using neural networks (Mitra and Craswell, 2018;Nogueira et al., 2019Nogueira et al., , 2020. In this two-stage pipeline, the overall performance is critically dependent on the first retrieval stage, since the failure of the retrieval stage would highly affect the second re-ranking stage. Therefore, this bottleneck on the first stage has to be addressed for performance enhancement (Karpukhin et al., 2020). BM25 and query likelihood (QL) are the most popular adhoc retrieval models for the first stage (Nogueira et al., 2019;Boudin et al., 2020;. More recently, instead of using sparse models, methods of using dense representations have been proposed (Karpukhin et al., 2020;Xiong et al., 2020;Qu et al., 2020), which can help alleviate the vocabulary mismatch problem through a dense representation space. However, recent work has revealed the limitations on their performance and efficiency (Lin, 2019;Xiong et al., 2020;Luan et al., 2020). Furthermore, these dense representation methods are based on supervised learning, where pairs of query and related-document are usually required to ensure reasonable performance.
Query / Document Expansion Query and document expansions have been widely used in IR systems. In terms of query expansion, Jaleel et al. (2004) proposed pseudo relevance feedback (RM3), which is revisited in more recent work (Dibia, 2020;Mao et al., 2020) for its strength. There are also methods that expand queries using generation schemes (Mao et al., 2020;Claveau, 2020). However, query expansion suffers from its intrinsic drawbacks, as queries need to be manipulated during the retrieval phase and have relatively less information than documents (Nogueira et al., 2019). Thus, we take the alternative route: expanding documents. Nogueira et al. (2019) and  proposed to expand documents with generated text using a supervised model trained on query-document pairs. In contrast, our framework generates document-related sentences regardless of the existence of the corresponding query. Boudin et al. (2020) proposed to expand documents with sequence-to-sequence models which output keyphrases; however, their models have to be trained from scratch on a specific domain.

Document-relevant Text Generation
In order to enrich the given document efficiently with the document-relevant text, such text should contain the document's key context which can appear in the summarized sentence. Earlier, Erkan and Radev (2004) and Mihalcea and Tarau (2004) proposed unsupervised models of extracting key sentences, which are adopted in various recent work for their robustness (Nikolov and Hahnloser, 2020a;Zhang et al., 2020b;Kazemi et al., 2020). In contrast, an abstractive approach aims at generating summarized sentences containing novel terms that might not exist in the given document (Zhang et al., 2020a;Yang et al., 2020). Nikolov and Hahnloser (2020b) proposed to first extract the key sentences and then paraphrase them with back-translation. Recent work has reported that the improved performance of text summarization approaches is attributed to the pre-trained language models (Zhang et al., 2020a;Xu et al., 2020).
In this work, we aim at abstractly generating a document-related sentence with a pre-trained language model and further propose to diversely generate sentences with stochastic perturbation, not just using a single summarized sentence.

Method
Our goal is to expand the document for IR tasks by generating document-related text, which contains novel but semantically similar terms for the given document without using query-document pairs. In this section, we describe formal description of the IR task.

Preliminaries
We begin with a formal description of the IR task, and then introduce a document expansion scheme.

Information Retrieval
The objective of an IR task is to retrieve the most relevant document d ∈ D for the given query q ∈ Q, where Q and D indicate query and document set, respectively. Note that the query and document pair can be represented as either sparse (Robertson et al., 1994;Zhai and Lafferty, 2017) or dense (Lin, 2019;Xiong et al., 2020), which gives rise to different implementation details. Suppose that we are given a query-document pair (q, d) in the correct query-document set τ : (q, d) ∈ τ , where τ ⊂ Q × D. Then, the system should retrieve the most relevant document d for the given query q in the correct query-document set τ , denoted as follows: where f : Q × D → R is a score function that measures the similarity of the correct query-document pairs, to retrieve the most relevant document for the given but unseen query at test time.  we need to deal with the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically related. To address this problem, we focus on the document expansion scheme, which augments the document with relevant terms to make a richer document. Formally, the goal of document expansion is to generate semantically

Document Expansion
for the given document d ∈ D, denoted as follows: where K is the number of terms associated with each document d and g is the document expansion model parameterized by θ. After generating relevant terms t = [t i ] K i=1 for the document, we concatenate them with the original document d to construct the more meaningful documentrepresentationd, denoted as follows: where ⊕ is the concatenation operation. By expanding relevant terms to the given document with the generation model g, the similarity between the query q and expanded document d becomes stronger than the similarity between query q and original document d, as follows: In order to maximize the similarity score between q andd, we need the model g that generates document-related terms without using labels of query-document pairs τ for training, which we describe in the next subsection.

Unsupervised Text Generation for Document Expansion
We now describe our Unsupervised Document Expansion with Generation (UDEG) framework, which generates relevant terms for the given document d without using labels on query-document pairs (q, d) ∈ τ . We first introduce the extractive and abstractive text generation schemes, which are two representative methods for unsupervised text generation, and then propose a stochastic generation scheme for a richer vocabulary.
Extractive Text Generation Extractive text generation is to select the representative words or sentences on the given document. Formally, an extractive text generation scheme is defined as follows: where g ext is an extractive text generation model parameterized by θ ext . After extracting terms t ext = [(t ext ) i ], they are used to expand the document as in Equation 3 (i.e.,d = [t ext ⊕ d]), which can enrich the representation of the given document by counting important terms multiple times (See Table 1 for examples of extractive generation).
Abstractive Text Generation While the previously described extractive text generation model aims at enriching the given document with key terms extracted from it, the expressiveness of this extractive scheme is highly restricted since novel but semantically similar terms cannot be generated as in Equation 4: To overcome this limitation, one should further consider generating related-terms that are not contained in the original document. To this end, we propose an abstractive text generation model to obtain the relevant but novel terms for the given document d.
Formally, novel terms for the original document are denoted as [(t abs ) l ] N l=1 ⊂ d, whereas existing terms on the document are denoted as N is the number of newly generated document-related terms. Then, an abstractive generation model is defined as follows: where g abs is the abstractive generation model parameterized by θ abs . We provide concrete examples of abstractive generation in Table 1.
Specific details of unsupervised text generation models, which do not use labels for querydocument pairs, are described in § 4.3.
Stochastic Generation While a naïve abstractive generation scheme can generate novel terms that are not included in the original document, a major drawback of this scheme is that they cannot generate a high volume of different terms for the given document. In other words, this scheme is suboptimal since it only generates a single sequence, though the terms within the document can have many synonymous expressions. To overcome this limitation, we stochastically generate terms for the given document by perturbing its embeddings for text generation via applying Monte Carlo (MC) dropout (Gal and Ghahramani, 2016). Compared to the abstractive generation scheme in Equation 5, which only produces one typical sequence of terms t abs , we obtain S different sequences T abs from the stochastic generation scheme, as follows: where g abs randomly masks weights on the model even at test time.
We provide examples of stochastic generation with S = 3 in Table 1. As shown in Table 1, examples of stochastic generation are more relevant to the document and more diverse.

Experimental Setups
Here, we describe datasets, models, evaluation metrics, and implementation details for experiments.

Datasts
We use two benchmark datasets for IR to evaluate our UDEG framework as follows: ANTIQUE: This is a dataset with 403,666 documents from Yahoo! Answer, including opendomain non-factoid questions (Hashemi et al., 2020). The test set consists of 200 queries and 6,589 query-document pairs. MS MARCO: This is a collection of 8,841,823 passages from Bing search engine (Nguyen et al., 2016). Since the test set is not publicly available, we use the development set containing 6,980 queries and 59,273 query-document pairs. We randomly sample 1,000,000 passages, while using the same development set for queries and querydocument pairs, due to the limitation of computational resources on expanding 8,841,823 passages.

Retrieval Models
In this subsection, we describe two retrieval models that are widely used for IR systems. BM25: This is one of the standard ad-hoc retrieval models based on Term Frequency-Inverse Document Frequency (TF-IDF), which measures overlapping terms between query and document (Robertson et al., 1994). QL: This is also one of the standard ad-hoc retrieval models. Specifically, QL returns a ranked list of documents sorted by the probability of P (d|q), where q is a query and d is a document (Zhai and Lafferty, 2017).

Expansion Models
We compare our UDEG framework against the following baselines: No Expansion (No Expan.): This is a naïve model of retrieving the original documents without query or document expansion. RM3: This is a query expansion model that uses a pseudo-relevance feedback scheme (RM3) (Jaleel et al., 2004). Note that this can be simultaneously used with document expansion models. MP-rank: This is an extractive document expansion model, which extracts keyphrases based on a multipartite graph, where the nodes are keyphrase candidates and an edge connects nodes having different topics (Boudin, 2018). LexRank: This is an extractive document expansion model that extracts the key sentence with PageRank algorithm (Page et al., 1998), which constructs vertices as sentences and edges as TF-IDF weights (Erkan and Radev, 2004). PEGASUS ext : This is an extractive document expansion model (Zhang et al., 2020a), which extracts sentences using pre-trained knowledge for generating masked sentences on the CNN/DailyMail dataset (Nallapati et al., 2016). LexRank + paraphrase (Lex. + Para.): This is an abstractive document expansion model, which first extracts key sentences with LexRank, and then paraphrases them with an unsupervised model  based on simulated annealing. UDEG: Our framework of expanding documents with abstractly generated sentences from a pretrained language model. Diverse sentences are generated with stochastic perturbation by MC dropout.

Metrics
We evaluate the models with five metrics, ranging from precision-to recall-oriented, as follows: Normalized Discounted Cumulative Gain (NDCG@K): Compared to the MAP that uses binary relevance metrics, this further manipulates the recommended list by using the fact that some documents are more relevant than others.

Implementation Details
All of the retrieval models are implemented using Anserini open-source IR toolkit (Yang et al., 2018) with the default hyperparameter values. The PEGASUS-large model, already fine-tuned on the XSUM dataset (Narayan et al., 2018), is used as a pre-trained language model in UDEG for abstractive text generation. For the decoding algorithm, we use a beam search algorithm and set the beam  size as 8. Also, we set the number S of stochastic generation for document expansion as 4.

Results and Discussion
In this section, we show the overall performance of our UDEG, and then analyze the results in detail.

Overall Results
Results on the ANTIQUE dataset and sampled MS MARCO dataset are shown in Table 2 and Table 3, respectively. Our UDEG framework significantly outperforms all baselines in all evaluation metrics. Interestingly, the retrieval performance of QL is impressively enhanced when using our framework. Note that the retrieval performance of QL without expansion is much lower than BM25; however, QL shows comparable and even outstanding performance with our expansion framework.

Effectiveness of Abstractive Generation
Compared to the extractive and the paraphrasing baselines, our proposed abstractive framework outperforms them in all metrics. Notably, even though PEGASUS ext is pre-trained on the same PEGA-SUS pipeline with the UDEG framework, the expansion model with the extractive generation scheme is ineffective, since it cannot solve the vocabulary mismatch problem. However, the proposed UDEG framework can solve it by generating novel words, which demonstrates the effectiveness of the abstractive generation scheme.

Effectiveness of Query Expansion
When RM3 is applied, the performance is negatively affected in most cases. As Nogueira et al. (2019) reported, we can also interpret the obtained results as evidence that document expansion is more effective than query expansion since a document often contains more signals than a query with its longer length.

Ablation and Discussion
Which attributes contribute how much to the performance improvement? To see this, we further perform an ablation study, as follows.

Robustness on Different Language Models
To validate the robustness of our framework on different language models, we compare the performances of PEGASUS and BART , both of which are trained on the XSUM dataset. As shown in Figure 2, the UDEG framework with PEGASUS shows performance similar to the one with BART, both of which consistently outperform the naïve baseline, which neither expands the query nor the document. Thus, the results show that the UDEG framework does not depend on a specific language model, but robustly improves the overall retrieval performance.

Comparison of Stochastic Generation Strategy
We compare two stochastic generation strategies, MC dropout and top-k sampling. The top-k sampling is designed to generate diverse outputs by sampling the next word from the k most likely candidates, instead of deterministically selecting the next word (Fan et al., 2018). As shown in Figure 3, even though both strategies aim at generating diverse sentences stochastically, the MC dropout strategy outperforms the top-k sampling strategy. Where does this performance difference come from? The hypothesis is that MC dropout makes more diverse terms across sentences than top-k sampling. Specifically, we often obtain the same starting words from top-k sampling, which leads to generate a number of sentences that might share same starting words. On the other hand, MC Query How is the chemistry is a basic of science?

Relevant Document
Chemistry is a basic because all matter can be broken down into elements (i.e., hydrogen, oxygen, nitrogen, etc.); without matter, nothing could be studied. Generated Sentences 1) Chemistry is the study of atoms and molecules. 2) Chemistry is the study of matter and how it is made. 3) Chemistry is the study of matter. 4) Chemistry is a basic science.
Original Document Rank: 104 Expanded Document Rank: 5 Query How is the library consider as a heart of university?

Relevant Document
Whatever you are studying has to be found somewhere for you to learn it. That's where the library comes into focus. Generated Sentences 1) If you're studying at university, you'll need a library. 2) A library is a place where you can find out more about the subject you are studying. 3) If you're studying, you'll be studying. 4) There are many different ways you can study.
Original  dropout randomly perturbs the embeddings at the beginning of generating each sentence, which leads to a diversity of terms even at the starting point.
To verify this hypothesis, we compare the lexical diversity of MC dropout and top-k sampling strategies with a varying number of generated sentences. The lexical diversity is calculated by averaging the proportion of the unique unigrams in generated sentences for each document. As Figure 4 shows, the lexical diversities of the generated sentences by top-k sampling are consistently lower and drop more rapidly than that by MC-dropout.

Varying the Number of Expanded Sentences
To understand how stochastically generated sentences with MC dropout improves the retrieval performance, we experiment our UDEG with a varying number of generated sentences on two retrieval models, BM25 and QL. Figure 3 shows that the performances of both models tend to improve with increasing numbers of expanded sentences. Interestingly, QL is largely improved as stochastically generated sentences are stacked up to the original document. Meanwhile, the performance is slightly dropped when expanding five sentences for BM25. These results indicate that setting an appropriate number of generated sentences is important for optimal results, since too much information may degrade the context of the original document.

Case Study
For a qualitative analysis, we conduct a case study to explore the strengths of the UDEG framework. Table 4 shows examples of successfully retrieved expanded-documents with the UDEG framework compared to the original documents without expansion. Note that the original documents are retrieved with lower ranks, but get higher ranks after applying the UDEG framework. We note that the generated sentences contain novel words, while they sometimes contain copied terms. This tendency of copying increases the importance of the keyphrases which contributes to the effective term re-weighting. At the same time, newly generated terms are found to resolve the vocabulary mismatch problem by introducing synonyms or semantically related terms. These findings advocate for the importance of using abstractly generated sentences for document expansion in ad-hoc retrieval systems, which can help term re-weighting and alleviate the vocabulary mismatch problem at the same time.

Conclusion
We presented a novel framework, which we refer to as Unsupervised Document Expansion with Generation (UDEG), that generates diverse terms with stochastic perturbation over pre-trained language models, and efficiently enriches the document representation, without using any query infor-mation for training. Remarkably, UDEG employed in a retrieval system shows significant performance improvements on two standard benchmark datasets.
Also, a detailed analysis shows that an abstractive generation framework with stochastic perturbation positively contributes to the retrieval performance. Not only synonymy, but also other problems of the IR system such as polysemy could be addressed using our UDEG framework, to be left for the future work. We believe that the benefits of using diversely generated document-relevant sentences would allow further improvements on any IR system, targeting at scholarly and scientific information.