Long Document Ranking with Query-Directed Sparse Transformer

The computing cost of transformer self-attention often necessitates breaking long documents to fit in pretrained models in document ranking tasks. In this paper, we design Query-Directed Sparse attention that induces IR-axiomatic structures in transformer self-attention. Our model, QDS-Transformer, enforces the principle properties desired in ranking: local contextualization, hierarchical representation, and query-oriented proximity matching, while it also enjoys efficiency from sparsity. Experiments on four fully supervised and few-shot TREC document ranking benchmarks demonstrate the consistent and robust advantage of QDS-Transformer over previous approaches, as they either retrofit long documents into BERT or use sparse attention without emphasizing IR principles. We further quantify the computing complexity and demonstrates that our sparse attention with TVM implementation is twice more efficient that the fully-connected self-attention. All source codes, trained model, and predictions of this work are available at https://github.com/hallogameboy/QDS-Transformer.


Introduction
Pre-trained Transformers such as BERT (Devlin et al., 2019) effectively transfer language understanding to better relevance estimation in many search ranking tasks . Nevertheless, the effectiveness comes at the quadratic cost O(n 2 ) in computing complexity corresponds to the text length n, prohibiting its direct application to long documents. Prior work adopts quick workarounds such as document truncation or splitting-and-pooling to retrofit the document ranking task to pretrained transformers. Whilst there have been successes with careful architecture design, those bandit-solutions inevitably introduce information loss and create complicated system pipelines.
Intuitively, effective document ranking does not require fully connected self-attention between all query and document terms. The relevance matching between queries and documents often takes place at text segments as opposed to individual tokens (Callan, 1994;Jiang et al., 2019), suggesting that a document term may not need information thousands of words away (Metzler and Croft, 2005;Child et al., 2019), and that not all document terms are useful to calculate the relevance to the query (Xiong et al., 2017). The fully connected attention matrix includes many unlikely connections that create efficiency debt in computing, inference time, parameter size, and training convergence.
This paper presents Query-Directed Sparse Transformer (QDS-Transformer) for long document ranking. In contrast to retrofitted solutions, QDS-Transformer fundamentally considers the desirable properties for assessing relevance by focusing on attention paths that matter. Using sparse local attention (Child et al., 2019), our model removes unnecessary connections between distant document tokens. Using global attention upon sentence boundaries, our model further incorporates the hierarchical structures within documents. Last but not the least, we use global attention on all query terms that direct the focus to the relevance matches between query-document term pairs. These three attention patterns in our Query-Directed Sparse attention, as illustrated in Figure 1, permit global dissemination of IR-axiomatic information while keeping computation compact and essential.
In our experiments with TREC Deep Learning Track (Craswell et al., 2020) and three more fewshot document ranking benchmarks (Zhang et al., 2020), QDS-Transformer consistently improves the standard retrofitting BERT ranking baselines (e.g., max-pooling on paragraphs) by 5% NDCG. It also shows gains over more recent transformer architectures that induces various sparse structures, including Sparse Transformer, Longformer, and Transformer-XH, as they were not designed to incorporate the essential information required in document ranking. In the meantime, we also thoroughly quantify the efficiency improvement from our query-directed sparsity, showing that with TVM support (Chen et al., 2018), different sparse attention patterns lead to variant training and inference speed up, and in general QDS-Transformer enjoys 200%+ speed up compared to vanilla BERT on long documents.
Our visualization also shows interesting learned attention patterns in QDS-Transformer. Similar to the observation on BERT in NLP pipeline (Tenney et al., 2019), in lower QDS-Transformer levels, the attention focuses more on learning the local interactions and document hierarchies, while in higher layers the model focuses more on relevance matching with the query terms. We also show examples that QDS attention may center on the sole sentence that directly answers the query, or may span across several sentences that cover different aspects of the query, depending on the scope of the intent; this brings the advantage of better interpretability based on sparse attention.

Related Work
Neural models have demonstrated significant advances across various ranking tasks (Guo et al., 2019). Early approaches investigated diverse ways to capture relevance between queries and documents (Guo et al., 2016;Xiong et al., 2017;Dai et al., 2018;Hui et al., 2017). And recently the state-of-the-art in many text ranking tasks has been taken by BERT or other pretrained language models (Devlin et al., 2019;Dai and Callan, 2019;Craswell et al., 2020), when sufficient relevance labels are available for fine-tuning (e.g., on MS MARCO (Bajaj et al., 2016)).
The improved effectiveness comes with the cost of computing efficiency with deep pretrained transformers, especially on long documents. This stimulates studies investigating ways to retrofit long documents to BERT's maximum sequence length limits (512). A vanilla strategy is to truncate or split the documents: Dai and Callan (2019) applied BERT ranking on each passage segmented from the document independently and explored different ways to combine the passage ranking scores, using the score of the first passage (BERT-FirstP), the best passage (BERT-MaxP) (also studied in Yan et al. (2020)), or the sum of all passage scores (BERT-SumP).
More sophisticated approaches have also been developed to introduce structures to transformer attentions. Transformer-XL employs recurrence on a sequence of text pieces (Dai et al., 2019), Transformer-XH (Zhao et al., 2020) models a group of text sequences by linking them with eXtra Hop attention paths, and Transformer Kernel Long (TKL) (Hofstätter et al., 2020) uses a sliding window over the document terms and matches them with the query terms using matching kernels (Xiong et al., 2017).
On the efficiency front, Kitaev et al. (2020) proposed Reformer that employed locality-sensitive hashing and reversible residual layers to improve the efficiency of Transformers. Child et al. (2019) introduced sparse transformers to reduce the quadratic complexity to O(L √ L) by applying sparse factorizations to the attention matrix, making the use of self-attention possible for extremely long sequences. Subsequent work (Sukhbaatar et al., 2019;Correia et al., 2019) leverage a similar idea in a more adaptive way. Combining local windowed attention with a task motivated global attention, Beltagy et al. (2020) presented Longformer with an attention mechanism that scales linearly with sequence length.

· · ·
· · · · · · . . . document based on their relevance to the query. BERT Ranker. The standard way to leverage pretrained BERT in document ranking is to concatenate the query and the document into one text sequence, feed it into BERT layers, and then use a linear layer on top of the last layer's [CLS] token :

Fully-connected Layer
This BERT ranker can be fine-tuned using relevance labels on (q, d) pairs, as simple as a classification task, and has achieved strong performances in various text ranking benchmarks (Bajaj et al., 2016;Craswell et al., 2020). Transformer Layer.
More specifically, let {t 0 , t 1 , ..., t i , ..., t n } be the tokens in the concatenated q-d sequence, with query tokens t 1:|q| ∈ q and document tokens t |q|+1:n ∈ |d|, considering special tokens being part of q or d. The l-th transformer layer in BERT takes the hidden representations of previous layer (H l−1 ), which is embedding for l = 1, and produces a new H l as follows (Vaswani et al., 2017).
It first passes the previous representations through the self-attention mechanism, using three projections (Eqn. 5), and then calculates the attention matrix between all token pairs using their querykey similarity (Eqn. 4, as in single-head formation).
The attention matrix M then is used to fuse all other tokens' representation V , to obtain the updated representation for each position (Eqn. 2). In the end, another feed-foreword layer is used to obtain the final representation of this layer H l (Eqn. 1). The matrix A is the n 2 "adjencency" matrix in which each entry is one if there is an attention path between corresponding positions: A ij = 1 means t i queries the value of t j using the key of t j . In standard transformer and BERT, the attention paths are fully connected thus A = 1. Computation Complexity. In each of the BERT layers, all the feed-forward operations (Eqn. 1 and 5) are applied to each individual token, leading to linear complexity w.r.t. text length n and the square of the hidden dimension size dim. The selfattention operation in Eqn. 2 and 4 calculates the attention strengths upon all token pairs, leading to squared complexity w.r.t text length but linear of the hidden dimension size.
The complexity of one transformer layer in BERT thus includes two components:

QDS-Transformer
Recent research has shown that with sufficient training and fully-connected self-attention, BERT learns attention patterns that capture meaningful structures in language (Clark et al., 2019) or for specific tasks (Zhao et al., 2020). However, this is not yet the case in long document ranking as computing becomes the bottleneck. This section first presents how we overcome this bottleneck by injecting IR-specific inductive bias as sparse attention patterns. Then we discuss the efficient implementation of sparse attention. 4.1 Query-Directed Sparse Attention Mathematically, inducing sparsity in self-attention is to modify the attention adjacency matrix A by only keeping connections that are meaningful for the task. For document retrieval, we include two groups of informative connections as sparse adjacency matrices: local attention and query-directed global attention.

Local Attention
Intuitively, it is unlikely that a token needs to see another token thousands of positions away to learn its contextual representation, especially in the lower transformer layers which are more about syntactic and less about long-range dependencies (Tenney et al., 2019). We follow this intuition used in the Sparse Transformer (Child et al., 2019) and define the following local attention paths: It only allows a token to see another token in each transformer layer if the two are w/2 position away, with w the window size. The local attention serves as the backbone for many sparse transformer variations as it provides the basic local contextual information (

Query-Directed Global Attention
The local attention itself does not fully capture the relevance matches between the query and documents. We introduce several query-directed attention patterns to incorporate inductive biases widely used in document representation and ranking. Hierarchical Document Structures. A common intuition in document representation is to leverage the hierarchical structures within documents, for example, words, sentences, paragraphs, and sections, and compose them into hierarchical attention networks (Yang et al., 2016). We use a two-level word-sentence-document hierarchy and inject this hierarchical structure by adding fully connected attention paths to all the sentences. Specifically, we first prepend a special token [SOS] (start-of-sentence) to each sentence in the document, and form the following attention connections: Matching with the Query. For retrieval tasks, arguably the most important principle is to capture the semantic matching between queries and documents. Inducing this information is as simple as adding dedicated attention paths on query terms: It allows each token to see all query terms so as to learn query-dependent representations.

Summary
The three attention patterns together form the query-directed attention in QDS-Transformer: We also add the global attention between all other tokens and [CLS]. Keeping everything else standard in BERT and using this query-directed sparse attention (A QDS ) in place of the fully-connected self-attention (A), we obtain our QDS-Transformer architecture as illustrated in Figure 2. Interestingly, QDS-Transformer also resembles various effective IR-Axioms developed in past decades. For example, in QDS attention, a query term mainly focuses on the [SOS] token through A Sent , while the [SOS] token recaps the proximity (Callan, 1994) matches locally around it through A Local . The local attention in the query part also resembles the effective phrase matches (Metzler and Croft, 2005) as the query term representations are contextualized using other query terms through A Local .

Efficient Sparsity Implementation
Our query-directed sparse attention reduces the self-attention complexity from O(n 2 dim) to O(n · dim · (w + |q| + |s|)), where the local window size w and query length |q| are constant to document length, and the number of sentences is orders of magnitude smaller.
However, to implement this sparsity efficiently on GPU is not that straightforward. Naively using  For ad-hoc retrieval, we also consider CO-PACRR (Hui et al., 2018) which employs CNNs without using pretrained NLM (non-PLM). Note that IDST (Yan et al., 2020) is not comparable because it exploits external generators for document expansion. For the few-shot learning task, we additionally compare with SDM, RankSVM, Coor-Ascent, and Conv-KNRM as reported in previous studies (Xiong et al., 2017;Dai et al., 2018). More details of the baselines can be found in Appendix B.
Implementation Details. We implement all methods with PyTorch (Paszke et al., 2019) and the Hugging Face transformer library (Wolf et al., 2019), excluding the baselines that have previously reported their scores. For sparse attention, we implement it using TVM with a custom CUDA kernel in PyTorch (Chen et al., 2018). Models are optimized by the Adam optimizer (Kingma and Ba, 2014) with a learning rate 10 −5 , (β 1 , β 2 ) = (0.9, 0.999), and a dropout rate 0.1. The dev set is used for hyperparameter tuning to decide the best model, which is then applied to the test set. We set the maximum length of input sequences as 2,048. The dimension of the dense layer F dense (·) in relevance estimation is 768, while the local attention window size w is 128. All experiments are conducted on an Nvidia DGX-1 server with 512 GB memory and 8 Tesla V100 GPUs. Each method is limited to access only one GPU for fair comparisons.

Evaluation Results
This section evaluates QDS-Transformer in its effectiveness, attention patterns, and efficiency. We also analyze the learned query-directed attention weights and show case studies. Table 2 summarizes the retrieval effectiveness on the TREC-19 DL benchmark. Table 3 shows the few-shot performance on the three TREC datasets. QDS-Transformer consistently outperforms baseline methods on all datasets in both experimental settings. Note that the higher MAP scores from some methods in TREC-19 DL is because they have better first stage retrieval and are not using the same reranking setting. QDS-Transformer outperforms the best BERT-based TREC run by 3.25% in NDCG@10 and is more effective than the concurrent sliding window approach, TKL. Moreover, QDS-Transformer outperforms RoBERTa (MaxP), which is the standard retrofitted method for BERT, by 6% in NDGG@10 while also being a unified framework.

Retrieval Effectiveness
Compared with Sparse Transformers and Longformer-QA, QDS-Transformer provides more than 5% improvement in nearly all datasets. The best baseline is Transformer-XH, which creates structural sparsity by breaking a document into segments and introduces effective eXtra-hop attentions to jointly model the relevance of those   segments. While these methods show competitive effectiveness especially with our TVM implementation, QDS-Transformer is consistently more accurate through the query-directed sparse attention patterns in all evaluation settings.

Effectiveness of Attention Patterns
This experiment studies the contribution of our query-directed sparse attention patterns to QDS-Transformer's effectiveness. Table 4 shows the ablation results of the three attention patterns in TREC-19 DL benchmark: local attention only (A local , Sparse Transformer), hierarchical attention on sentence only (A sent , QDS-Transformer (S)), and query-oriented attention only (A query , QDS-Transformer (Q)). All three sparse attention patterns contribute. As expected, queryoriented attention is most effective to capture the relevance match between query and documents. Note that the RoBERTa (MaxP) and Transformer-XH also attend to queries, but the attention is more localized as the document is broke into separated text pieces and the query is concatenated with each of them. In comparison, QDS-Transformer mimics the proximity matches and captures the global hier-  archical structures in the document using dedicated attention from query terms to sentences. Figure 3 depicts the change in retrieval effectiveness by varying the local attention window size. Both NDCG@10 and MAP@10 grow at a steady pace starting from a window size of 32 and peak at 128, but no additional gain is observed with bigger window sizes. The information from a term 512 tokens away does not provide many signals in relevance matching and is safely pruned in QDS-Transformer. Note that the dip at attention size 1024 is because our model is initialized from RoBERTa which is only pretrained on 512 tokens.

Model Efficiency
This experiment benchmarks the efficiency of different sparse attention patterns. Their training and inference time (ms per query-document pair, or MSpP) is shown in Table 5.
RoBERTa on 2048 tokens is prohibitive; we only measured its time with random parameters as we were not able to actually train it. Retrofitting was a natural choice to leverage pretrained models. Sparsity helps. Sparse-Transformer (128) is much faster than MaxP. Interestingly, its attention matrix with only 4.56% non-zero entries leads to on par efficiency with retrofitted solutions and also only 5 times faster compared to full attention; this is due to the required cost involved in feed-forward. This effect is also reflected in the efficiency of QDS-Transformer with different local window sizes. Different sparsity patterns dramatically influence the optimization of TVM. Intuitively, patterns with more regular shape would be easier to optimize than more customized connections in TVM. For example, the skipping patterns along sentence boundary in QDS-Transformer (S) seems more forgiving than the query-oriented attentions (Q). Comparing efficiency with and without our TVM implementation, the diagonal sparse shape in Sparse-Transformer is much better optimized.
How to better utilize the advantage from sparsity and structural inductive biases is perhaps a necessary future research direction in an era where models with fewer than one billion parameters are no longer considered large (Brown et al., 2020). Making progress in this direction may need more close collaborations between experts in application, modeling, and infrastructure. Q1: 1037798 (who is robert gray) Q2: 1110199 (what is wifi vs bluetooth) docid: D3533931 docid: D1325409 Heads 1,2,4,6,9,10,11,12: Head 01: Bluetooth's low power consumption make it useful where power is limited.

Robert Gray (title)
Head 02: Wi-Fi appliances are often plugged into wall outlets to operate. Heads 3,5,7,8: Robert Gray, (born May 10, 1755, Tiverton, R.I. died summer 1806, at sea near eastern U.S. coast), captain of the first U.S. ship to circumnavigate the globe and explorer of the Columbia River.
Head 07: The extremely low power requirements of the latest Bluetooth 4.0 standard allows wireless connectivity to be added to devices powered only by watch batteries. Head 09: A Wi-Fi enabled network relies on a hub. Head 10: The advantages of using bluetooth from existing technology. Head 11: Wi-Fi is more suited to data-intensive activities such as streaming high-definition movies, while Bluetooth is better suited to tasks such as transferring keyboard strokes to a computer. Head 12: The greater power of Wi-Fi network also means it can move data more quickly than Bluetooth network.

Learned Attention Weights
This experiment analyzes the learned attention weights in QDS-Transformer, using the approach developed by Clark et al. (2019). Figure 4 illustrates the average maximum attention weights of the three attention patterns used in our model. Interestingly, the model tends to implicitly conduct hierarchical attention learning (Yang et al., 2016), where lower layers focus on learning structures and pay more attention to [SOS] tokens, while higher layers emphasize the relevance by attending to queries more. Attention on both types of tokens is consistently stronger than on the [CLS] token. The model is capturing the inductive biases emphasized by our sparse attention structures. Figure 5 shows the average entropy of the attention weight distribution. Intuitively, lower layer attention tends to have high entropy and thus a very broad view over many words, to create contextualized representations. The entropy of query and [SOS] are in general lower, as they focus on capturing information needs and document structures. The entropy of all three types of tokens rises again in the last layer, implying that they may try to aggregate representation for the whole input. Table 6 shows a case study of sentences with the highest attention weight from [CLS] in the last layer for two example queries. For factoid query Q1, all heads center on precise sentences that can directly answer the query. For Q2 that is on the exploratory side, different attention heads exhibit diverse patterns focusing on partial evidence that can provide a broader understanding collectively. Table 7 depicts the other case study on learned attention weights of sentences from query tokens. We adopt the third transformer layer, where sentences obtain more attention as shown in Figure 4, to emphasize significant sentences for query tokens. The results show query-directed attention can capture sentences with different topics matched to individual query tokens, thereby comprehending sophisticated document structure.

Case Study on Learned Attention Weights
These findings suggest that QDS-Transformer has an interesting potential to be applied to not only retrieval but also the question-answering task in NLP, providing a generic and effective framework, while also being interpretable with its the sparse structural attention connectivity. We further provide an additional case study in Appendix C.

Conclusions
QDS-Transformer improves the efficiency and effectiveness of pretrained transformers in long document ranking using sparse attention structures. The sparsity is designed to capture the principal properties (IR-Axioms) that are crucial for relevance modeling: local contextualization, document structures, and query-focused matching. In four TREC document ranking tasks with variant settings, QDS-Transformer consistently outperforms competitive baselines that retrofit to BERT or use sparse attention not designed for document ranking.
Our experiments demonstrate the promising future of joint optimization of structural domain knowledge and efficiency from sparsity, while its current form is somewhat at the infancy stage. Our analyses also indicate the potential of better interpretability from sparse structures and more unified models for IR and QA.  Few-shot Document Ranking. All experimental settings for few-shot learning are consistent with the"MS MARCO Human Labels" setting in previous studies (Zhang et al., 2020). Each method first trains a neural ranker on MARCO training labels, which are identical as in the TREC DL track. The latent representations of trained models are then considered as features for a Coor-Ascent ranker for low-label datasets using five-fold crossvalidation (Dai and Callan, 2019;Dai et al., 2018) to rerank top-100 SDM retrieved results (Metzler and Croft, 2007). Standard metrics NDCG@20 and ERR@20 are used to compare the different approaches. The results are reported by taking the average of each test fold from the total 5 folds, wherein the rest 4 folds in each round are used as training and dev queries.
Hyperparameter Settings and Search. We adopt the pretrained model for sparse attention (Beltagy et al., 2020) and fix all of the hidden dimension numbers as 768 and the number of transformer layers as 12. BERT-based models use RoBERTa as pretrained models (Liu et al., 2019). To hyperparameter tuning, we search the local attention window size w in {32, 64, 128, 256, 512, 1024} with the dev set and determine w = 128. Models are optimized by the Adam optimizer (Kingma and Ba, 2014) with a learning rate 10 −5 , (β 1 , β 2 ) = (0.9, 0.999), and a dropout rate 0.1. Under the hyperparameter settings, the parameter numbers of our implemented methods are shown in Table 8 summarizing the sizes of parameters based on model.parameters() in PyTorch.

A.3 Evaluation Scripts
All evaluation measures are computed by the official scripts. For ad-hoc retreival, we use trec eval 6 as the standard tool in the TREC community for evaluating ad-hoc retreival runs. This is also the official setting of the TREC-19 deep learning track. For few-shot document ranking, we use graded relevance assessment script (gdeval) 7 as the evaluation script measuring NDCG and ERR. Note that this setting is consistent with previous studies (Zhang et al., 2020;Dai and Callan, 2019). food You spear small cubes of bread onto long-stemmed forks and dip them into the hot cheese (taking care not to lose the bread in the fondue).

food
Jamie Oliver has this easy cheese fondue recipe, and this five-star recipe has good reviews. popular Table 9: Case study of the query 833860 with the query tokens with the highest attention weights in the 10-th transformer layer among all heads from the [SOS] tokens of sentences in the document D2944963.

B Baseline Methods
In this section, we introduce each baseline method. TREC Best Runs.
• bm25tuned prf (Yang and Lin, 2019) finetunes the BM25 parameters with pseudo relevance feedback as the best BM25 based method in official runs. • srchvrs run1 is marked as the best traditional ranking method among official runs (Craswell et al., 2020). • TUW19-d3-re (Hofstätter et al., 2019) as the best method without using non-pretrained language models (non-PLM) in official runs utilizes a transformer to encode both of the query and the document, thereby measuring interactions between terms and scoring the relevance. • bm25 expmarcomb (Akkalyoncu Yilmaz et al., 2019) combines sentence-level and document-level relevance scores with a pretrained BERT model. Classical IR Methods.
• SDM (Metzler and Croft, 2005) as a sequential dependence model conducts ranking based on the theory of probabilistic graphical models. We obtain ranking results of SDM from previous studies (Dai and Callan, 2019). SDM is not only treated as a baseline method but also providing the candidate documents for reranking in the few-shot learning task. • Coor-Ascent (Metzler and Croft, 2007) is a linear feature-based model for ranking. It is also considered as the trainer in few-shot learning with representations from methods. Neural IR Methods.
• CO-PACRR (Hui et al., 2018) utilizes CNNs to model query-document similarity matrices and provide a score using a max-pooling layer.

C Additional Study on Attention Weights
In addition to attention from the classification token [CLS] and query tokens as shown in Section 6.5, here we analyze the attention from sentences. Table 9 shows the query tokens with the highest attention weights in the 10-th transformer layer among all head from the [SOS] tokens of sentences. Note that the 10-th transformer layer indicates higher importance of query tokens as shown in Figre 4. The results show that QDS-Transformer is capable of directing sentences to the tokens with matched topics, thereby understanding sophisticated document structure with different topics.