Modularized Transfomer-based Ranking Framework

Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval. However, these Transformers are computationally expensive, and their opaque hidden states make it hard to understand the ranking process. In this work, we modularize the Transformer ranker into separate modules for text representation and interaction . We show how this design enables substantially faster ranking using ofﬂine pre-computed representations and light-weight online interactions. The modular design is also easier to interpret and sheds light on the ranking process in Transformer rankers. 1


Introduction
Neural rankers based on Transformer architectures (Vaswani et al., 2017) fine-tuned from BERT (Devlin et al., 2019) achieve current state-of-theart (SOTA) ranking effectiveness Craswell et al., 2019). The power of the Transformer comes from self-attention, the process by which all possible pairs of input tokens interact to understand their connections and contextualize their representations. Self-attention provides detailed, token-level information for matching, which is critical to the effectiveness of Transformer-based rankers (Wu et al., 2019).
When used for ranking, a Transformer ranker takes in the concatenation of a query and document, applies a series of self-attention operations, and outputs from its last layer a relevance prediction . The entire ranker runs like a black box and hidden states have no explicit meanings. This represents a clear distinction from earlier neural ranking models that keep separate text representation and distance (interaction) functions. Transformer rankers are slow , and the black-box design makes it hard to interpret their behavior.
We hypothesize that a Transformerbased ranker simultaneously performs text representation and query-document interaction as it processes the concatenated pair. Guided by this hypothesis, we decouple representation and interaction with a MOdualarized REranking System (MORES). MORES consists of three Transformer modules: the Document Representation Module, the Query Representation Module, and the Interaction Module. The two Representation Modules run independently of each other. The Document Representation Module uses self-attention to embed each document token conditioned on all document tokens. The Query Representation Module embeds each query token conditioned on all query tokens. The Interaction Module performs attention from query representations to document representations to generate match signals and aggregates them through self-attention over query tokens to make a relevance prediction.
By disentangling the Transformer into modules for representation and interaction, MORES can take advantage of the indexing process: while the interaction must be done online, document representations can be computed offline. We further propose two strategies to pre-compute document representations that can be used by the Interaction Module for ranking.
Our experiments on a large supervised ranking dataset demonstrate the effectiveness and efficiency of MORES. It is as effective as a stateof-the-art BERT ranker and can be up to 120× faster at ranking. A domain adaptation experiment shows that the modular design does not affect the model transfer capability, so MORES can be used under low-resource settings with simple adaptation techniques. By adapting individual modules, we discovered differences between represen-tations and interaction in adaptation. The modular design also makes MORES more interpretable, as shown by our attention analysis, providing new understanding of black-box Transformer rankers.

Related Work
Neural ranking models for IR proposed in previous studies can be generally classified into two groups (Guo et al., 2016): representation-based models, and interaction-based models.
Representation-based models learn latent vectors (embeddings) of queries and documents and use a simple scoring function (e.g., cosine) to measure the relevance between them. Such methods date back to LSI (Deerwester et al., 1990) and classical siamese networks (Bromley et al., 1993). More recent research considered using modern deep learning techniques to learn the representations. Examples include DSSM (Huang et al., 2013), C-DSSM (Shen et al., 2014), etc. Representations-based models are efficient during evaluation because the document representations are independent of the query, and therefore can be pre-computed. However, compressing a document into a single low-dimensional vector loses specific term matching signals (Guo et al., 2016). As a result, previous representation-based ranking models mostly fail to outperform interactionbased ones.
Interaction-based models, on the other hand, use a neural network to model the word-level interactions between the query and the document. Examples include DRMM (Guo et al., 2016) and K-NRM (Xiong et al., 2017). Recently, Transformers (Vaswani et al., 2017), especially BERT (Devlin et al., 2019) based Transformers, have been widely used in information retrieval ranking tasks Dai and Callan, 2019;Qiao et al., 2019). BERT-based rankers concatenate query and document into a single string and apply self-attention that spans over the query and the document in every layer. Rankers using pre-trained Transformers such as BERT has become the current state-ofthe-art (Craswell et al., 2019). However, the performance gains come at the computational cost of inferring the many token-level interaction signals at the evaluation time, which scales quadratically to the input length. It is an open question whether we can combine the advantages of representation-based and interaction-based approaches.
Little research has studied this direction prior to this work.
There are several research directions aiming to reduce the computational cost of Transformer models. One line of research seeks to compress the big Transformer into smaller ones using model pruning (Voita et al., 2019) or knowledge distillation (Hinton et al., 2015;. Another line of research aims to develop new Transformer-like units that have lower complexity than the original Transformer. For example, (Child et al., 2019) introduces sparse factorizations of the attention matrix which efficiently compute subsets of the attention matrix. The focus of this work is an efficient framework to combine Transformers for ranking; all aforementioned techniques can be applied to individual Transformers within our framework, and are therefore orthogonal to this paper.

Proposed Method
In this section, we introduce the Modularized Reranking System (MORES), how MORES can speed up retrieval, and how to effectively train and initialize MORES.

The MORES Framework
A typical Transformer ranker takes in the concatenation of a query qry and a document doc as input. At each layer, the Transformer generates a new contextualized embedding for each token based on its attention to all tokens in the concatenated text. This formulation poses two challenges. First, in terms of speed, the attention consumes time quadratic to the input length. As shown in Table 1, for a query of q tokens and a document of d tokens, the Transformer would require assessments of (d + q) 2 pairs of tokens. Second, as query and document attention is entangled from the first layer, it is challenging to interpret the model. MORES aims to address both problems by disentangling the Transformer ranker into document representation, query representation, and interaction, each with a dedicated Transformer, as shown in Figure 1. The document representation is query-agnostic and can be computed off-line. The interaction uses query-to-document attention, which further reduces online complexity. This separation also assigns roles to each module, making the model more transparent and interpretable.

Score
The two Representation Modules use Transformer encoders (Vaswani et al., 2017) to embed documents and queries respectively and independently. In particular, for documents, and for queries, where lookup represents word 2 and position embeddings, and Encoder represents a Transformer encoder layer. Query and document Representation Modules can use different numbers of layers. Let M and N denote the number of layers for document and query representations respectively. The hidden states from the last layers are used as the Representation Modules' output. Formally, for a document of length d, query of length q, and model dimension n, let matrix D = H doc M ∈ R d×n be the output of the Document Representation Module and Q = H qry N ∈ R q×n be the output of the Query Representation module.
The Interaction Module uses the Representation Modules' outputs, Q and D, to make a relevance judgement. The module consists of a stack of Interaction Blocks (IB), a novel attentive block 2 We use WordPiece tokens, following BERT. that performs query-to-document cross-attention, followed by query self-attention 3 , as shown in Figure 1. Here, we write cross-attention from X to Y as Attend(X, Y ), self-attention over X as Attend(X, X) and layer norm as LN. Let, Equation 5 models interactions from query tokens to document token. Each query token in Q attends to document embeddings in D to produce relevance signals. Then, Equation 6 collects and exchanges signals among query tokens by having the query tokens attending to each other. The output of the first Interaction Block (IB) is then computed with a feed-forward network (FFN) on the query token embeddings with residual connections, We employ multiple Interaction Blocks to iteratively repeat this process and refine the hidden query token representations, modeling multiple rounds of interactions, producing a series of hidden states, while keeping document representation D unchanged, The Interaction Block (IB) is a core component of MORES. As shown in Table 1, its attention avoids the heavy full-attention over the concatenated query-document sequence, i.e. (d + q) 2 terms, saving online computation.
To induce relevance, we project the [CLS] token's embedding in the last (K th ) IB's output to a score,

Pre-Compute and Reuse Representation
MORES's modular design allows us to precompute and reuse representations. The Query Representation Module runs once when receiving the new query; the representation is then repeatedly used to rank the candidate documents. More importantly, the document representations can be built offline. We detail two representation  Projected Document Representation Reuse Strategy (Reuse-S2) further moves documentrelated computation performed in the Interaction Module offline. In an IB, the cross-attention operation first projects document representation D with key and value linear projections (Vaswani et al., 2017) where W k , W v are the projection matrices. For each IB, Reuse-S2 pre-computes and stores D proj 4 , Using Reuse-S2, the Interaction Module no longer needs to compute the document projections at online evaluation time. Reuse-S2 takes more storage: for each IB, both key and value projections of D are stored, meaning that an Interaction Module with l IBs will store 2l projected versions of D.
With this extra pre-computation, Reuse-S2 trades storage for further speed-up. Table 1 analyzes the online time complexity of MORES and compares it to the time complexity of a standard BERT ranker. We note that MORES can move all document only computation offline. Reuse-S1 avoids the document self attention term d 2 , which is often the most expensive part due to long document length. Reuse-S2 further removes from online computation the document transformation term n 2 d, one that is linear in document length and quadratic in model dimension.

MORES Training and Initialization
MORES needs to learn three Transformers: two Representation Modules and one Interaction Module. The three Transformer modules are coupled during training and decoupled when used. To train MORES, we connect the three Transformers and enforce module coupling with end-to-end training using the pointwise loss function (Dai and Callan, 2019). When training is finished, we store the three Transformer modules separately and apply each module at the desired offline/online time.
We would like to use pre-trained LM weights to ease optimization and improve generalization. However, there is no existing pre-trained LM that involves cross-attention interaction that can be used to initialize the Interaction Module. To avoid expensive pre-training, we introduce BERT weight assisted initialization. We use one copy of BERT weights to initialize the Document Representation Module. We split another copy of BERT weights between Query Representation and Interaction Modules. For MORES with l IBs, the first 12−l layers of the BERT weights initialize the Query Representation Module, and the remaining l layers' weights initialize the Interaction Module. This initialization scheme ensures that Query Representation Module and the IBs use consecutive layers from BERT. As a result, upon initialization, the output of the Query Representation Module and the input of the first IB will live in the same space. In addition, for IBs, query to document attention initializes with the same BERT attention weights as query self-attention. In practice, we found initializing query to document attention weights important; random initialization leads to substantially worse performance. Details can be found in subsection 4.2.

Effectiveness and Efficiency in Supervised Ranking
The first experiment compares the effectiveness and efficiency of MORES to a state-of-the-art BERT ranker for supervised ranking.

Setup
We use the MS MARCO passage ranking collection (MS MARCO) (Nguyen et al., 2016) and evaluate on two query sets with distinct characteristics: Dev Queries have a single relevant document with a binary relevance label. Following Nguyen et al. (2016), we used MRR@10 to evaluate the ranking accuracy on this query set. TREC2019 DL Queries is the evaluation set used in the TREC 2019 Deep Learning Track. Its queries have multiple relevant documents with graded relevance.
Following Craswell et al. (2019), we used MRR, NDCG@10, and MAP@1000 as evaluation metrics. All methods were evaluated in a reranking task to re-rank the top 1000 documents of the MS MARCO official BM25 retrieval results.
We test MORES effectiveness with a varied number of Interaction Blocks (IB) to study the effects of varying the complexity of query-document interaction. Models using 1 layer of IB (1× IB) up to 4 layers of IB (4× IB) are tested.
We compare MORES with the BERT ranker, a state-of-the-art ranker fine-tuned from BERT, which processes concatenated query-document pairs. Both rankers are trained with the MS MARCO training set consisting of single relevance queries. We train MORES on a 2M subset of Marco's training set. We use stochastic gradient descent to train the model with a batch size of 128. We use AdamW optimizer with a learning rate of 3e-5, a warm-up of 1000 steps and a linear learning rate scheduler for all MORES variants. Our baseline BERT model is trained with similar training setup to match performance reported by . Our BERT ranker re-implementation has better performance compared to that reported by . The BERT ranker and all MORES models are implemented with Pytorch (Paszke et al., 2019) based on the huggingface implementation of Transformers .
We aim to test that MORES' accuracy is equivalent to the original BERT ranker (while achieving higher efficiency). To establish equivalence, statistical significance testing was performed with a non-inferiority test commonly used in the medical field to test that two treatments have similar effectiveness (Jayasinghe et al., 2015). In this test, rather than testing to reject the null hypothesis H 0 : µ BERT = µ MORES , we test to reject H 0 : µ BERT − µ MORES > δ for some small margin δ. By rejecting H 0 we accept the alternative hypothesis, which is that any reduction of performance in MORES compared to the original BERT ranker is inconsequential. We set the margin δ to 2% and 5% of the mean of the BERT ranker. Table 2 reports the accuracy of MORES and the baseline BERT-based ranker. The experiments show that MORES with 1× IB can achieve 95% of BERT performance. MORES with 2× IB can achieve performance comparable to the BERT ranker with a 2% margin. Three IBs does not improve accuracy and four hurts accuracy. We believe that this is due to increased optimization difficulties which outweighs improved model capacity. Recall that for MORES we have one set of artificial cross attention weights per IB not initialized with real pre-trained weights. Performance results are consistent across the two query sets, showing that MORES can identify strong relevant documents (Dev Queries), and can also generalize to ranking multiple, weaker relevant documents (TREC2019 DL Queries).

Ranking Effectiveness
The results show that MORES can achieve ranking accuracy competitive with state-of-the-art ranking models, and suggest that the entangled and computationally expensive full-attention Transformer can be replaced by MORES's lightweight, modularized design. Document  We also investigate IB initialization and compare MORES 2× IB initialized by our proposed initialization method (copy self attention weight of BERT as IB cross attention weight), with a random initialization method (cross attention weights randomly initialized). Table 3 shows that random initialization leads to a substantial drop in performance, likely due to difficulty in optimization.

Ranking Efficiency
Section 3.2 introduces two representation reuse strategies for MORES with different time vs. space trade-offs. This experiment measures MORES' real-time processing speeds with these two strategies and compares them with measurement for the BERT ranker. We test MORES 1× IB and MORES 2× IB. Additional IB layers incur more computation but do not improve effectiveness, and are hence not considered. We record average time for ranking one query with 1000 candidate documents on an 8-core CPU and a single GPU. 5 We measured ranking speed with documents of length 128 and 512 with a fixed query length of 16. Tables 4 (a) and (b) show the speed tests for the 5 Details are in Appendix A.1. two reuse strategies, respectively. We also include per document data storage size 6 . We observe a substantial speedup in MORES compared to the BERT ranker, and the gain is consistent across CPUs and GPUs. The original BERT ranker took hundreds of seconds -several minutes -to generate results for one query on a CPU machine, which is impractical for realtime use. Using Reuse-S1, MORES with 1× IB was 40x faster than the BERT ranker on shorter documents (d = 128); the more accurate 2× IB model also achieved 20x speedup. The difference is more profound on longer documents. As the length of the document increases, a larger portion of compute in BERT ranker is devoted to performing self-attention over the document sequence. MORES pre-computes document representations Table 5: Domain adaptation on ClueWeb09-B. adapt-interaction and adapt-representation use MORES 2× IB. * and † indicate non-inferiority (Section 4.1) with p < 0.05 to the BERT ranker using a 5% or 2% margin, respectively.  Reuse-S2 -the projected document reuse strategy -further enlarges the gain in speed, leading to up to 170x speedup using 1× IB, and 120x speedup using 2× IB. Recall that Reuse-S2 precomputes the document projections that will be used in MORES' Interaction Module, which is of n 2 d time complexity where n is the model hidden dimension (details can be found in the complexity analysis in Table 1). In practice, n is often large, e.g., our experiment used n = 768 7 . Reuse-S2 avoids the expensive n 2 d term at evaluation time. Note that Reuse-S2 does not affect accuracy; it trades space to save more time.

Adaptation of MORES and Modules
The second experiment uses a domain-adaptation setting to investigate whether the modular design of MORES  Domain adaptation is done by taking a model trained on MS MARCO and fine-tuning the model on relevant labels from the target dataset. Due to the small query sets in ClueWeb09-B and Ro-bust04, we use 5-fold cross-validation for finetuning and testing. Data split, initial ranking, and document pre-processing follow Dai and Callan (2019). The domain adaptation fine-tuning procedures use a batch size of 32 and a learning rate of 5e-6 while having other training settings same as supervised ranking training.

Full Model Adaptation
The top 5 rows of Table 5 and Table 6 examine the effectiveness of adapting the full model of MORES. The adapted MORES models behave similarly as on MS MARCO: using two to three layers of Interaction Blocks (IB) achieves very close to BERT ranker performance on both datasets for both types of queries while using a single layer of IB is less effective. Importantly, our results show that the modular design of MORES does not hurt domain transfer, indicating that new domains and low resource domains can also use MORES through simple adaptation.

Individual Module Adaptation
With separate representation and interaction components in MORES, we are interested to see how each is affected by adaptation. We test two extra adaptation settings on MORES 2× IB: finetuning only Interaction Module on the target domain (adapt-interaction) or only Representation Modules (adapt-representation) on target domain. Results are shown in the bottom two rows of Table 5 and Table 6 for the two data sets. We observe that only adapting the Interaction Module to the target domain is less effective compared to adapting the full model (MORES 2× IB), suggesting that changing the behaviour of interaction is not enough to accommodate language changes across domains. On the other hand, freezing the Interaction Module and only fine-tuning the Representation Modules (adapt-representation) produces performance on par with full model apdatation. This result shows that it is more necessary to have domain-specific representations, while interaction patterns are more general and not totally dependent on representations.

Analysis
The modular design of MORES allows Representation and Interaction to be inspected separately, providing better interpretability than a black-box Transformer ranker. Figure 2 examines the attention with MORES for a hard-to-understand query "what is paranoid sc" where "sc" is ambiguous, along with a relevant document "Paranoid schizophrenia is a psychotic disorder. In-depth information on symptoms...." 8 In the Document Representation Module (Figure 2a), we can see that "disorder" uses "psychotic" and "schizophrenia" for contextualization, making itself more specific. In the Query Representation Module (Figure 2b), because the query is short and lacks context, "sc" incurs a broad but less meaningful attention. The query token "sc" is further contextualized in the Interaction Module (Figure 2c) using information from the document side -"sc" broadly attends to the document token in the first IB to disambiguate itself. With the extra context, "sc" is able to correctly attend to "schizophrenia" in the second IB to produce relevance signals (Figure 2d).
This example explains why MORES 1× IB performs worse than MORES with multiple IBsambiguous queries need to gather context from the document in the first IB before making relevance estimates in the second. More importantly, the example indicates that the query to document attention has two distinct contributions: understand query tokens with the extra context from the document, and match query tokens to document tokens, with the former less noticed in the past. We believe MORES can be a useful tool for better interpreting and understanding SOTA black-box neural rankers.

Conclusion
State-of-the-art neural rankers based on the Transformer architecture consider all token pairs in a concatenated query and document sequence. Though effective, they are slow and challenging to interpret. This paper proposes MORES, a modular Transformer ranking framework that decouples ranking into Document Representation, Query Representation, and Interaction. MORES is effective while being efficient and interpretable.
Experiments on a large supervised ranking task show that MORES is as effective as a state-of-theart BERT ranker. With our proposed document representation pre-compute and re-use methods, MORES can achieve 120x speedup in online ranking while retaining accuracy. Domain adaptation experiments show that MORES' modular design does not hurt transfer ability, indicating that MORES can be adapted to low-resource domains with simple techniques.
Decoupling representation and interaction provides new understanding of Transformer rankers. Complex full query-document attention in stateof-the-art Transformer rankers can be factored into independent document and query representation, and shallow light-weight interaction. We further discovered two types of interaction: further query understanding based on the document, and the query to document tokens matching for relevance. Moreover, we found that the interaction in ranking is less domain-specific, while the representations need more domain adaptation. These findings provide opportunities for future work towards more efficient and interpretable neural IR.