Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval

This paper applies BERT to ad hoc document retrieval on news articles, which requires addressing two challenges: relevance judgments in existing test collections are typically provided only at the document level, and documents often exceed the length that BERT was designed to handle. Our solution is to aggregate sentence-level evidence to rank documents. Furthermore, we are able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles. Our simple neural ranking models achieve state-of-the-art effectiveness on three standard test collections.


Introduction
The dominant approach to ad hoc document retrieval using neural networks today is to deploy the neural model as a reranker over an initial list of candidate documents retrieved using a standard bag-of-words term-matching technique. Despite the plethora of neural models that have been proposed for document ranking (Mitra and Craswell, 2019), there has recently been some skepticism about whether they have truly advanced the state of the art (Lin, 2018), at least in the absence of large amounts of behavioral log data only available to search engine companies.
In a meta-analysis of over 100 papers that report results on the dataset from the Robust Track at TREC 2004 (Robust04), Yang et al. (2019a) found that most neural approaches do not compare against competitive baselines. To provide two recent examples, McDonald et al. (2018) report a best AP score of 0.272 and  0.290, compared to a simple bag-of-words query expansion baseline that achieves 0.299 (Lin, 2018). Further experiments by Yang et al. (2019a) achieve 0.315 under more rigorous experimental conditions with a neural ranking model, but this is still pretty far from the best-known score of 0.3686 on this dataset (Cormack et al., 2009).
Although Sculley et al. (2018) remind us that the goal of science is not wins, but knowledge, the latter requires first establishing strong baselines that accurately quantify proposed contributions. Comparisons to weak baselines that inflate the merits of an approach are not new problems in information retrieval (Armstrong et al., 2009), and researchers have in fact observed similar issues in the recommender systems literature as well (Rendle et al., 2019;Dacrema et al., 2019).
Having placed evaluation on more solid footing with respect to well-tuned baselines by building on previous work, this paper examines how we might make neural approaches "work" for document retrieval. One promising recent innovation is models that exploit massive pre-training (Peters et al., 2018;Radford et al., 2018), leading to BERT (Devlin et al., 2019) as the most popular example today. Researchers have applied BERT to a broad range of NLP tasks with impressive gains: most relevant to our document ranking task, these include BERTserini (Yang et al., 2019b) for question answering and Nogueira and Cho (2019) for passage reranking.
Extending our own previous work (Yang et al., 2019c), the main contribution of this paper is a successful application of BERT to yield large improvements in ad hoc document retrieval. We introduce two simple yet effective innovations: First, we focus on integrating sentence-level evidence for document ranking to address the fact that BERT was not designed for processing long spans of text. Second, we show, quite surprisingly, that it is possible to transfer models of relevance across different domains, which nicely solves the problem of the lack of passage-level relevance anno-tations. Combining these two innovations allows us to achieve 0.3697 AP on Robust04, which is the highest reported score that we are aware of (neural or otherwise). We establish state-of-the-art effectiveness on two more recent test collections, Core17 and Core18, as well.

Background and Approach
To be clear, our focus is on neural ranking models for ad hoc document retrieval, over corpora comprising news articles. Formally, in response to a user query Q, the system's task is to produce a ranking of documents from a corpus that maximizes some ranking metric-in our case, average precision (AP). We emphasize that this problem is quite different from web search, where there is no doubt that large amounts of behavioral log data, along with other signals such as the webgraph, have led to large improvements in search quality (Mitra and Craswell, 2019). Instead, we are interested in limited data conditions-what can be achieved with modest resources outside web search engine companies such as Google and Microsoft who have large annotation teams-and hence we only consider "classic" TREC newswire test collections.
Beyond our own previous work (Yang et al., 2019c), which to our knowledge is the first reported application of BERT to document retrieval, there have been several other proposed approaches, including MacAvaney et al. (2019) and unrefereed manuscripts (Qiao et al., 2019;Padigela et al., 2019). There are two challenges to applying BERT to document retrieval: First, BERT has mostly been applied to sentence-level tasks, and was not designed to handle long spans of input, having a maximum of 512 tokens. For reference, the retrieved results from a typical bagof-words query on Robust04 has a median length of 679 tokens, and 66% of documents are longer than 512 tokens. Second, relevance judgments in nearly all newswire test collections are annotations on documents, not on individual sentences or passages. That is, given a query, we only know what documents are relevant, not spans within those documents. Typically, a document is considered relevant as long as some part of it is relevant, and in fact most of the document may not address the user's needs. Given these two challenges, it is not immediately obvious how to apply BERT to document ranking.

Key Insights
We propose two innovations that solve the abovementioned challenges. First, we observe the existence of test collections that fortuitously contain passage-level relevance evidence: the MAchine Reading COmprehension (MS MARCO) dataset (Bajaj et al., 2018), the TREC Complex Answer Retrieval (CAR) dataset (Dietz et al., 2017), and the TREC Microblog (MB) datasets (Lin et al., 2014), described below.
MS MARCO features user queries sampled from Bing's search logs and passages extracted from web documents. Each query is associated with sparse relevance judgments by human editors. TREC CAR uses queries and paragraphs extracted from English Wikipedia: each query is formed by concatenating an article title and a section heading, and passages in that section are considered relevant. This makes CAR, essentially, a synthetic dataset. TREC Microblog datasets draw from the Microblog Tracks at TREC from 2011 to 2014, with topics (i.e., queries) and relevance judgments over tweets. We use the dataset prepared by Rao et al. (2019).
Note, however, that all three datasets are out of domain with respect to our task: MS MARCO passages are extracted from web pages, CAR paragraphs are taken from Wikipedia, and MB uses tweets. These corpora clearly differ from news articles to varying degrees. Furthermore, while MS MARCO and MB capture search tasks (although queries over tweets are qualitatively different), CAR "queries" (Wikipedia headings) do not reflect search queries from real users. Nevertheless, our experiments arrive at the surprising conclusion that these datasets are useful to train neural ranking models for news articles.
Our second innovation involves the aggregation of sentence-level evidence for document ranking. That is, given an initial ranked list of documents, we segment each into sentences, and then apply inference over each sentence separately, after which sentence-level scores are aggregated to yield a final score for ranking documents. This approach, in fact, is well motivated: There is a long thread of work in the information retrieval literature, dating back decades, that leverages passage retrieval techniques for document ranking (Hearst and Plaunt, 1993;Callan, 1994;Kaszkiel and Zobel, 1997;Clarke et al., 2000). In addition, recent studies of human searchers (Zhang et al., 2018b,a) revealed that the "best" sentence or paragraph in a document provides a good proxy for documentlevel relevance.

Model Details
The core of our model is a BERT sentence-level relevance classifier. Following Nogueira and Cho (2019), this is framed as a binary classification task. We form the input to BERT by concatenating the query Q and a sentence S into the sequence [[CLS], Q, [SEP], S, [SEP]] and padding each sequence in a mini-batch to the maximum length in the batch. We feed the final hidden state corresponding to the [CLS] token in the model to a single layer neural network whose output represents the probability that sentence S is relevant to the query Q.
To determine document relevance, we apply inference over each individual sentence in a candidate document, and then combine the top n scores with the original document score as follows: where S doc is the original document score and S i is the i-th top scoring sentence according to BERT. In other words, the relevance score of a document comes from the combination of a document-level term-matching score and evidence contributions from the top sentences in the document as determined by the BERT model. The parameters α and the w i 's can be tuned via cross-validation.

Experimental Setup
We begin with BERT Large (uncased, 340m parameters) from Devlin et al. (2019), and then finetune on the collections described in Section 2.1, individually and in combination. Despite the fact that tweets aren't always sentences and that MS MARCO and CAR passages, while short, may in fact contain more than one sentence, we treat all texts as if they were sentences for the purposes of fine-tuning BERT. For MS MARCO and CAR, we adopt exactly the procedure of Nogueira and Cho (2019). For MB, we tune on 2011-2014 data, with 75% of the total data reserved for training and the rest for validation. We use the maximum sequence length of 512 tokens in all experiments. We train all models using cross-entropy loss for 5 epochs with a batch size of 16. We use Adam (Kingma and Ba, 2014) with an initial learning rate of 1 × 10 −5 , linear learning rate warmup at a rate of 0.1 and decay of 0.1. All experiments are conducted on NVIDIA Tesla P40 GPUs with PyTorch v1.2.0.
During inference, we first retrieve an initial ranked list of documents to depth 1000 from the collection using the Anserini toolkit 1 (post-v0.5.1 commit from mid-August 2019, based on Lucene 8.0). Following Lin (2018) and Yang et al. (2019a), we use BM25 with RM3 query expansion (default parameters), which is a strong baseline, and has already been shown to beat most existing neural models. We clean the retrieved documents by stripping any HTML/XML tags and splitting each document into its constituent sentences with NLTK. If the length of a sentence with the metatokens exceeds BERT's maximum limit of 512, we further segment the spans into fixed size chunks. All sentences are then fed to the BERT model.
We conduct end-to-end document ranking experiments on three TREC newswire collections: the Robust Track from 2004 (Robust04) and the Common Core Tracks from 2017 and 2018 (Core17 and Core18). Robust04 comprises 250 topics, with relevance judgments on a collection of 500K documents (TREC Disks 4 and 5). Core17 and Core18 have only 50 topics each; the former uses 1.8M articles from the New York Times Annotated Corpus while the latter uses around 600K articles from the TREC Washington Post Corpus. Note that none of these collections were used to fine-tune the BERT relevance models; the only learned parameters are the weights in Eq (1).
Based on preliminary exploration, we consider up to the top three sentences; any more does not appear to yield better results. For Robust04, we follow the five-fold cross-validation settings in Lin (2018) and Yang et al. (2019a); for Core17 and Core18 we similarly apply five-fold cross validation. The parameters α and the w i 's are learned via exhaustive grid search as follows: we fix w 1 = 1 and then vary a, w 2 , w 3 ∈ [0, 1] with a step size 0.1, selecting the parameters that yield the highest average precision (AP). Retrieval results are reported in terms of AP, precision at rank 20 (P@20), and NDCG@20.
Code for replicating all the experiments described in this paper is available as part of our recently-developed Birch IR engine. 2 Additional

Results and Discussion
Our main results are shown in Table 1. The top row shows the BM25+RM3 query expansion baseline using default Anserini parameters. 3 The remaining blocks display the ranking effectiveness of our models on Robust04, Core17, and Core18. In parentheses we describe the fine-tuning procedure: for instance, MSMARCO → MB refers to a model that was first fine-tuned on MS MARCO and then on MB. The nS preceding the model name indicates that inference was performed using the top n scoring sentences from each document. Table 1 also includes results of significance testing using paired t-tests, comparing each condition with the BM25+RM3 baseline. We report significance at the p < 0.01 level, with appropriate Bonferroni corrections for multiple hypothesis testing. Statistically significant differences with respect to the baseline are denoted by †.
We find that BERT fine-tuned on MB alone significantly outperforms the BM25+RM3 baseline for all three metrics on Robust04. On Core17 and Core18, we observe significant increases in AP as well (and other metrics in some cases). In other words, relevance models learned from tweets successfully transfer over to news articles despite large differences in domain. This surprising result highlights the relevance matching power introduced by the deep semantic information learned by BERT.
Fine-tuning on MS MARCO or CAR alone yields at most minor gains over the baselines across all collections, and in some cases actually hurts effectiveness. Furthermore, the number of sentences considered for final score aggregation does not seem to affect effectiveness. It also does not appear that the synthetic nature of CAR data helps much for relevance modeling on newswire collections. Interestingly, though, if we fine-tune on CAR and then MB (CAR → MB), we obtain better results than fine-tuning on either MS MARCO or CAR alone. In some cases, we slightly improve over fine-tuning on MB alone. One possible explanation could be that CAR has an effect similar to language model pre-training; it alone cannot directly help the downstream document retrieval task, but it provides a better representation that can benefit from MB fine-tuning.
However, we were surprised by the MS MARCO results: since the dataset captures a search task and the web passages are "closer" to our newswire collections than MB in terms of domain, we would have expected relevance transfer to be more effective. Results show, however, that fine-tuning on MS MARCO alone is far less effective than fine-tuning on MB alone.
Looking across all fine-tuning configurations, we see that the top-scoring sentence of each candi-date document alone seems to be a good indicator of document relevance, corroborating the findings of Zhang et al. (2018a). Additionally considering the second ranking sentence yields at most a minor gain, and in some cases, adding a third actually causes effectiveness to drop. This is quite a surprising finding, since it suggests that the document ranking problem, at least as traditionally formulated by information retrieval researchers, can be distilled into relevance prediction primarily at the sentence level.
In the final block of the table, we present our best model, with fine-tuning on MS MARCO and then on MB. We confirm that this approach is able to exploit both datasets, with a score that is higher than fine-tuning on each dataset alone. Let us provide some broader context for these scores: For Robust04, we report the highest AP score that we are aware of (0.3697). Prior to our work, the meta-analysis by Yang et al. (2019a), which analyzed over 100 papers up until early 2019, 4 put the best neural model at 0.3124 (Dehghani et al., 2018). 5 Furthermore, our results exceed the previous highest known score of 0.3686, which is a non-neural method based on ensembles (Cormack et al., 2009). This high water mark has stood unchallenged for ten years.
Recently, MacAvaney et al. (2019) reported 0.5381 NDCG@20 on Robust04 by integrating contextualized word representations into existing neural ranking models; unfortunately, they did not report AP results. Our best NDCG@20 on Robust04 (0.5325) approaches their results even though we optimize for AP. Finally, note that since we are only using Robust04 data for learning the document and sentence weights in Eq (1), and not for fine-tuning BERT itself, it is less likely that we are overfitting.
Our best model also achieves a higher AP on Core17 than the best TREC submission that does not make use of past labels or human intervention (umass baselnrm, 0.275 AP) (Allan et al., 2017). Under similar conditions, we beat every TREC submission in Core18 as well (with the best run being uwmrg, 0.276 AP) (Allan et al., 2018). Core17 and Core18 are relatively new and thus have yet to receive much attention from researchers, but to our knowledge, these figures represent the state of the art. 4 https://github.com/lintool/ robust04-analysis 5 Setting aside our own previous work (Yang et al., 2019c).

Conclusion
This paper shows how BERT can be adapted in a simple manner to yield large improvements in ad hoc document retrieval on "classic" TREC newswire test collections. Our results demonstrate two surprising findings: first, that relevance models can be transferred quite straightforwardly across domains by BERT, and second, that effective document retrieval requires only "paying attention" to a small number of "top sentences" in each document. Important future work includes more detailed analyses of these transfer effects, as well as a closer look at the contributions of document-level and sentence-level scores. Nevertheless, we believe that both findings pave the way for new directions in document ranking.