Beyond [CLS] through Ranking by Generation

Generative models for Information Retrieval, where ranking of documents is viewed as the task of generating a query from a document's language model, were very successful in various IR tasks in the past. However, with the advent of modern deep neural networks, attention has shifted to discriminative ranking functions that model the semantic similarity of documents and queries instead. Recently, deep generative models such as GPT2 and BART have been shown to be excellent text generators, but their effectiveness as rankers have not been demonstrated yet. In this work, we revisit the generative framework for information retrieval and show that our generative approaches are as effective as state-of-the-art semantic similarity-based discriminative models for the answer selection task. Additionally, we demonstrate the effectiveness of unlikelihood losses for IR.


Introduction
Most recent approaches for ranking tasks in Information Retrieval (IR) such as passage ranking and retrieval of semantically related questions have focused primarily on discriminative methods using neural networks that learn a similarity function to compare questions and candidate answers (Severyn and Moschitti, 2015;dos Santos et al., 2015;Tan et al., 2016;Tay et al., 2017Tay et al., , 2018)).On the other hand, classical literature on probabilistic models for IR showed that language modeling, a type of simple generative model, can be effective for document ranking (Zhai, 2008;Lafferty and Zhai, 2001;Ponte and Croft, 1998).The key idea consists of first training a unique language model lm i for each candidate document d i , then using the likelihood of generating the input query using lm i , denoted by P (q|lm i ), as the ranking score for document d i .
Recent advances in neural language models (NLMs) have led to impressive improvements in the quality of automatically generated text (Radford et al., 2019).However, to the best of our knowledge, there is no existing work on exploring the effectiveness of modern generative models such as GPT2, for complex ranking tasks such as answer selection.In this work, we intend to fill this gap by demonstrating that large pretrained generative models can be very effective rankers.Unlike classic LM based approaches for IR that employ separate LMs for each document, our proposed method uses a single global LM that applies to all documents.The global pretrained generator is fine-tuned on the task of query generation conditioned on document content as the context.Additionally, in order to leverage both positive and negative examples, we propose the use of (1) unlikelihood loss on negative examples and (2) ranking loss on the likelihood of positive and negative examples.At inference time, given an input query, our method scores each candidate document using the likelihood of generating the query given the document, as estimated by our fine-tuned global LM.
We focus our experiments on the task of answer selection (a.k.a passage ranking).In this task, given an input question and a set of candidate passages, the goal is to rank the candidate passages so that passages containing the correct answer appear at the top of the ranked list.Considerable body of work exists on the use of NNs for this task (Feng et al., 2015;Severyn and Moschitti, 2015;Tan et al., 2016;dos Santos et al., 2016;Rao et al., 2016;Wang et al., 2017), where the most recent ones use BERT-based models that perform discrimination based on the special [CLS] token (Nogueira and Cho, 2019;Li et al., 2019;Xu et al., 2019).A contemporaneous work by Nogueira et al. (2020) also proposes a generative approach for the passage ranking task.However, while their approach decides the relevance of a passage by generating a single keyword (e.g.true or false), our method Figure 1: Illustration of the inference step of our ranking by generation approach.Each candidate passage a k is ranked based on the likelihood of generating the question q conditioned on the passage, p θ (q|a k ).
uses the conditional likelihood of generating the question given the passage as a relevance score.
We perform extensive experiments using GPT2 (Radford et al., 2019) and BART (Lewis et al., 2019), which are Transformer-based LMs (Vaswani et al., 2017) that were pretrained using large volumes of textual data.The LMs are fine-tuned on four different passage ranking datasets separately: WikipassageQA, WikiQA, InsuranceQA V2, and YahooQA.Our experimental results indicate that our generative approaches are as effective as stateof-the-art discriminative-based approaches for answer selection.

Background
The goal in language modeling is to learn the probability distribution p(x) over variable-length token sequences x = (x 1 , x 2 , ..., x |x| ), where the tokens come from a fixed size vocabulary, x i ∈ V .When training an LM with causal language modeling objective, which consists of predicting the next token by looking at the past only, we can denote this distribution by the conditional probability of the next token given the previous ones (Bengio et al., 2003): GPT2 (Radford et al., 2019) is an example of a state-of-the-art neural LM trained with causal language modeling objective.The usual approach to train an LM using a neural network with parameters θ consists on performing maximum likelihood estimation (MLE) by minimizing the negative log-likelihood over a large text corpus Conditional LMs are a simple extension of regular LMs where the generation is conditioned on some additional context c (Keskar et al., 2019):

Proposed Ranking Approach
Our proposed approach for passage ranking by generation consists of first fine-tuning a pretrained large LM on the task of question generation conditioned on the passage, using the conditional LM approach shown in Eq. 3. In practice, each input for the fine-tuning step is as follows: <bos> passage <boq> question <eoq> where the passage is considered as a prompt, and the log-likelihood used in the training comes only from the tokens starting after the keyword <boq>, since we use the passage as a conditioning context.In other words, at training time, we minimize the negative conditional log-likelihood − log P (q|a), where a is a passage relevant to the query q.At inference time, given a query q, our conditional LM scores each candidate passage a k using the likelihood of generating the question conditioned on the passage, s(a k ) = p θ (q|a k ).Fig. 1 illustrates the inference step of our proposed approach.

Unlikelihood Loss for Ranking
Datasets for training passage rankers normally contain both positive and negative examples.Therefore, it is natural to use both types of examples in order to leverage all the available data.Let D be the set of examples (q, a, y), where y is 1 if the passage a is a positive answer for q, or 0 otherwise.We fine-tune the LM using the following loss function: The second term in Eq. 4 resembles the unlikelihood training objective of Welleck et al. (2019).However, while we use an unlikelihood objective with the aim of teaching the LM which questions are unlikely given the passage, Welleck et al. (2019) use an unlikelihood objective with the aim of improving text generation.We use the acronym LU L to refer to the loss function in Eq. 4, which performs likelihood and unlikelihood estimation.
We experimented an additional loss function to fine-tuning the LMs which consists on imposing a pairwise ranking loss on the likelihood (RLL) of positive and negative examples as follows: The use of unlikelihood losses to penalize negative examples is a natural choice for fine-tuning generative models.Note that Eq. 4 is an extension of the regular cross-entropy loss where we just added the unlikelihood term, while Eq. 5 is its ranking-based (hinge loss) version.The unlikelihood term in Eq. 4 can also be seen as a regularizer, which makes the ranking model less overconfident when computing query likelihoods.

Datasets
We use four different publicly available answer selection datasets in our experiments: Wikipas-sageQA (Cohen et al.), WikiQA (Yang et al., 2015), InsuranceQA V2 (Feng et al., 2015), and YahooQA (Tay et al., 2017).Statistics about the datasets are shown in Table 1.The four datasets also provide validation sets, which have size similar to the respective test sets.

Language Model Setup
We use pretrained GPT2-base (12 layers, 117M parameters), GPT2-large (24 layers, 345M params), BART-base (6 layers encoder and 6 layers decoder, 139M params) and BART-large (12 layers encoder and 12 layers decoder, 406M params) models in our experiments.We adopted the implementation and pretrained models from Wolf et al. (2019).We fine-tune GPT2 and BART on each training dataset separately.We perform a maximum of 10 finetuning epochs and adopt early stopping using the validation sets.Most of the hyperparmeters used for fine-tuning are the default ones from Wolf et al.
In the experiments presented below, the subscript M LE corresponds to models fine-tuned using just maximum likelihood estimation (Eq.2), which means that only positive examples are used.The subscript LU L corresponds to models fine-tuned using maximum likelihood and unlikelihood estimation (Eq.4), while RLL are models fine-tuned using the ranking loss in Eq. 5.For M LE and LU L, we use a mini-batch size of 64 for Insur-anceQA and 32 for the other 3 datasets.The number of negative examples per positive examples is set to 5 in the case of LU L.
When fine-tuning with RLL loss (Eq.5), we use a batch size of 8.During training, when processing a question we randomly sample 15 negative passages from the set of negative passages of the question.However, only the negative passage with the highest score is used to update the model.Early experiments demonstrated that this strategy performs similarly to the usual pairwise approach.

Ranking Results
In Table 2 we present the experimental results for our proposed generative approach and four state-ofthe-art discriminative baselines, which are based on BERT (Devlin et al., 2019) and BART.Both BERT-Sel (Li et al., 2019) and BERT-PR (Xu et al., 2019) fine-tuned BERT-base using a ranking loss on the score computed with [CLS] token.We trained a BERT-large model using [CLS]-based scoring + ranking loss (rows 3).We additionally trained a discriminative version of BART-large (row 4) where the input for the encoder and the decoder are the passage and the question, respectively.As it is normally adopted in BART for classification (Lewis et al., 2019), we take the representation generated by the decoder for the last token and use it to create a score by applying a linear layer.Such as the discriminative BERT models, we also optimize

YahooQA
WikiQA WikipassageQA InsuranceQA MAP MRR P@1 MAP MRR P@1 MAP MRR P@1 MAP MRR P@1 Discriminative Approaches 1 BERTSel-base (Li et al., 2019 BART-large using a ranking loss.The performance of the passage ranking models is assessed using the metrics Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Precision at 1 (P@1).Scores are computed with the official trec eval tool.
In the middle part of Table 2, we compare GPT2base without any fine-tuning (row 5), and finetuned with either M LE (6), LU L (7) or RLL (8).When the pretrained model only is used (no fine-tuning) the results are very poor.Which is understandable, given that the pattern of having a passage followed by a question might not be very recurrent in the data used to pretrain GPT2.Comparing M LE (row 6) with LU L (row 7), we see that the inclusion of the unlikelihood term (Eq.4) has a significant positive impact for all datasets but InsuranceQA.We believe the unlikelihood loss does not help on In-suranceQA because this dataset was not human curated and therefore contains a significant number of false-negative examples, which can hurt performance when used to compute the unlikelihood loss.Compared to BERT-base models, GPT2-base LU L is very competitive for most of the datasets except WikiQA, while GPT2-base RLL demonstrates more robust results across the different datasets.In rows 9 and 10 we show results for BART-base, where we see similar trends to GPT2-base with regard to LU L and RLL losses.BART-base RLL is overall better than BART-base LU L and GPT2-base models.
In the bottom part of Table 2, we also show results for GPT2-large and BART-large using LU L and RLL (rows 11 to 14).Overall, the larger generative models do a better job than the smaller ones, as expected.Among the generative approaches, BART-large RLL (row 14) is the model that performs the best for most of the datasets.We be-lieve that BART-based generative models outperform GPT2-based models due to 1) the larger number of pretraining tasks used in BART and 2) the use of bidirectional attention in the encoder side (which processes the passage).Comparing BART-large RLL with discriminative BART-large (row 4), we can see that BART-large RLL produces better results for InsuranceQA, while achieving similar performance for YahooQA, WikiQA and WikipassageQA.Overall, our proposed generative approach produces state-of-the-art results on the four tested datasets in all metrics.

Ranking with Passage Likelihood
A different setup that can be used for our approach is to compute the likelihood of the passage given the question, where the score for a candidate passage a k is given by s(a k ) = p θ (a k |q), and the score needs to be normalized by the passage length |a k |.This setup is inherently more difficult because the passage is normally much longer than the question and might contain many tokens that are not relevant for the question.
In Table 3, we present experimental results where we compare the use of either passage or question as the conditional context.As expected, using the likelihood of the passage given the question (p θ (a k |q)) as the score results in worse perfor- mance for both fine-tuning approaches: LU L and RLL.

Question and Passage Generation
A good side effect of using generative models to perform ranking is that we can use the trained model to generate new questions given a passage and vice-versa (depending on the conditioning context used for fine-tuning).This type of synthetically generated data could be used as additional training data to improve discriminative models such as BERT-PR (Xu et al., 2019).In Tables 4 and 5, we present some examples of questions and passages, respectively, that were generated using our fine-tuned GPT2-large LU L LM.In both cases we use a mixture of top k-sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) to generate the samples.Please note that the passages in Table 4 were extracted from the test set and are not present in the training set.The same applies for the question used in Table 5.
In Table 4, we can see that the generated questions are very fluent and, for most questions (except for the ones in italic), the input passage contains the answer for the question.In Table 5, we can observe that the generated passages are quite related to the input question.However, the content is normally not factual and contains inconsistencies and some repetitions.

Conclusion
We have proposed a new generative approach for IR based on large pretrained neural language models, and demonstrated their effectiveness as rankers by providing robust experimental results on four different datasets.Additionally, we demonstrated that unlikelihood-based losses are effective for allowing the use of negative examples in generative-based information retrieval.We believe that our approach can also be effectively used for text classification problems, where the score of a class label c is computed as the likelihood of generating the class label c given the document d, p(c|d).

Table 1 :
Dataset statistics.#Q stands for number of questions and #P/Q is the average number of passages per question

Table 2 :
Experimental results for different passage ranking models and datasets.

Table 3 :
Experimental results on using passage vs. question as the conditional context.Results are computed on the WikipassageQA dataset

Table 5 :
Examples of automatically generated passages using the GPT2-largeLUL model fine-tuned on the WikipassageQA dataset with likelihood p θ (a|q).The question was extracted from the test set.