The Cascade Transformer: an Application for Efficient Answer Sentence Selection

Large transformer-based language models have been shown to be very effective in many classification tasks. However, their computational complexity prevents their use in applications requiring the classification of a large set of candidates. While previous works have investigated approaches to reduce model size, relatively little attention has been paid to techniques to improve batch throughput during inference. In this paper, we introduce the Cascade Transformer, a simple yet effective technique to adapt transformer-based models into a cascade of rankers. Each ranker is used to prune a subset of candidates in a batch, thus dramatically increasing throughput at inference time. Partial encodings from the transformer model are shared among rerankers, providing further speed-up. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy, as measured on two English Question Answering datasets.


Introduction
Recent research has shown that transformer-based neural networks can greatly advance the state of the art over many natural language processing tasks. Efforts such as BERT , RoBERTa (Liu et al., 2019c), XLNet (Dai et al., 2019), and others have led to major advancements in several NLP subfields. These models are able to approximate syntactic and semantic relations between words and their compounds by pre-training on copious amounts of unlabeled data (Clark et al., 2019;Jawahar et al., 2019). Then, they can easily be applied to different tasks by just fine-tuning them on training data from the target domain/task (Liu et al., 2019a;Peters et al., 2019). The impressive effectiveness of transformer-based neural networks can be partially attributed to their large number of parameters (ranging from 110 million for "base" models to over 8 billion (Shoeybi et al., 2019)); however, this also makes them rather expensive in terms of computation time and resources. Being aware of this problem, the research community has been developing techniques to prune unnecessary network parameters (Lan et al., 2019;Sanh et al., 2019) or optimize the transformer architecture (Zhang et al., 2018;Xiao et al., 2019).
In this paper, we propose a completely different approach for increasing the efficiency of transformer models, which is orthogonal to previous work, and thus can be applied in addition to any of the methods described above. Its main idea is that a large class of NLP problems requires choosing one correct candidate among many. For some applications, this often entails running the model over hundreds or thousands of instances. However, it is well-known that, in many cases, some candidates can be more easily excluded from the optimal solution (Land and Doig, 1960), i.e., they may require less computation. In the case of hierarchical transformer models, this property can be exploited by using a subset of model layers to score a significant portion of candidates, i.e., those that can be more easily excluded from search. Additionally, the hierarchical structure of transformer models intuitively enables the re-use of the computation of lower blocks to feed the upper blocks.
Following the intuition above, this work aims at studying how transformer models can be cascaded to efficiently find the max scoring elements among a large set of candidates. More specifically, the contributions of this paper are: First, we build a sequence of rerankers SR N = {R 1 , R 2 , ..., R N } of different complexity, which process the candidates in a pipeline. Each reranker at position i takes the set of candidates selected by (i − 1)-th reranker and provides top k i candidates to the reranker of position i + 1. By requiring that k i < k i−1 ∀i = 1, . . . , N − 1, this approach allows us to save computation time from the more expensive rerankers by progressively reducing the number of candidates at each step. We build R i using transformer networks of 4, 6, 8, 10, and 12 blocks from RoBERTa pre-trained models.
Second, we introduce a further optimization on SR N to increase its efficiency based on the observation that models R i in SR N process their input independently. In contrast, we propose the Cascade Transformer (CT), a sequence of rerankers built on top of a single transformer model. Rerankers R 1 , . . . , R N are obtained by adding small feedforward classification networks at different transformer block positions; therefore, the partial encodings of the transformer blocks are used as both input to reranker R i , as well as to subsequent transformer encoding blocks. This allows us to efficiently re-use partial results consumed by R i for rankers R i+1 , . . . , R N .
To enable this approach, the parameters of all rerankers must be compatible. Thus, we trained CT in a multi-task learning fashion, alternating the optimization for different i, i.e., the layers of R i are affected by the back-propagation of its loss as well as by the loss of R j , with j ≤ i.
Finally, as a test case for CT, we target Answer Sentence Selection (AS2), a well-known task in the domain of Question Answering (QA). Given a question and a set of sentence candidates (e.g., retrieved by a search engine), this task consists in selecting sentences that correctly answer the question. We tested our approach on two different datasets: (i) ASNQ, recently made available by Garg et al. (2020); and (ii) a benchmark dataset built from a set of anonymized questions asked to Amazon Alexa. Our code, ASNQ split, and models trained on ASNQ are publicly available. 1 Our experimental results show that: (i) The selection of different k i for SR N determines different trade-off points between efficiency and accuracy. For example, it is possible to reduce the overall computation by 10% with just 1.9% decrease in accuracy. (ii) Most importantly, the CT approach largely improves over SR, reducing the cost by 37% with almost no loss in accuracy. (iii) The rerankers trained through our cascade approach achieve equivalent or better performance than transformer models trained independently. Finally, (iv) our results suggest that CT can be used with other 1 https://github.com/alexa/ wqa-cascade-transformers NLP tasks that require candidate ranking, e.g., parsing, summarization, and many other structured prediction tasks.

Related Work
In this section, we first summarize related work for sequential reranking of passages and documents, then we focus on the latest methods for AS2, and finally, we discuss the latest techniques for reducing transformer complexity.
Reranking in QA and IR The approach introduced in this paper is inspired by our previous work (Matsubara et al., 2020); there, we used a fast AS2 neural model to select a subset of instances to be input to a transformer model. This reduced the computation time of the latter up to four times, preserving most accuracy.
Before our paper, the main work on sequential rankers originated from document retrieval research. For example, Wang et al. (2011) formulated and developed a cascade ranking model that improved both top-k ranked effectiveness and retrieval efficiency. Dang et al. (2013) proposed two stage approaches using a limited set of textual features and a final model trained using a larger set of query-and document-dependent features.  focused on quickly identifying a set of good candidate documents that should be passed to the second and further cascades. Gallagher et al. (2019) presented a new general framework for learning an end-to-end cascade of rankers using back-propagation. Asadi and Lin (2013) studied effectiveness/efficiency trade-offs with three candidate generation approaches. While these methods are aligned with our approach, they target document retrieval, which is a very different setting. Further, they only used linear models or simple neural models. Agarwal et al. (2012) focused on AS2, but just applied linear models.
Answer Sentence Selection (AS2) In the last few years, several approaches have been proposed for AS2. For example, Severyn and Moschitti (2015) applied CNN to create question and answer representations, while others proposed interweighted alignment networks (Shen et al., 2017;Tran et al., 2018;Tay et al., 2018). The use of compare and aggregate architectures has also been extensively evaluated (Wang and Jiang, 2016;Bian et al., 2017;Yoon et al., 2019). This family of approaches uses a shallow attention mechanism over the question and answer sentence embeddings. Finally, Tayyar Madabushi et al. (2018) exploited fine-grained question classification to further improve answer selection.
Transformer models have been fine-tuned on several tasks that are closely related to AS2. For example, they were used for machine reading Yang et al., 2019a;Wang et al., 2019), ad-hoc document retrieval (Yang et al., 2019b;MacAvaney et al., 2019), and semantic understanding  tasks to obtain significant improvement over previous neural methods. Recently, Garg et al. (2020) applied transformer models, obtaining an impressive boost of the state of the art for AS2 tasks.
Reducing Transformer Complexity The high computational cost of transformer models prevents their use in many real-word applications. Some proposed solutions rely on leveraging knowledge distillation in the pre-training step, e.g., (Sanh et al., 2019), or used parameter reduction techniques (Lan et al., 2019) to reduce inference cost. However, the effectiveness of these approaches varies depending on the target task they have been applied to. Others have investigated methods to reduce inference latency by modifying how self-attention operates, either during encoding Guo et al., 2019b), or decoding (Xiao et al., 2019;Zhang et al., 2018). Overall, all these solutions are mostly orthogonal to our approach, as they change the architecture of transformer cells rather than efficiently re-using intermediate results.
With respect to the model architecture, our approach is similar to probing models 2 (Adi et al., 2017;Liu et al., 2019a;Hupkes et al., 2018;, as we train classification layers based on partial encoding on the input sequence. However, (i) our intermediate classifiers are integral part of the model, rather than being trained on frozen partial encodings, and (ii) we use these classifiers not to inspect model properties, but rather to improve inference throughput.
Our apporach also shares some similarities with student-teacher (ST) approaches for self-training (Yarowsky, 1995;McClosky et al., 2006). Under this setting, a model is used both as a "teacher" (which makes predictions on unlabeled data to obtain automatic labels) and as a "student" (which learns both from gold standard and automatic labels). In recent years, many variants of ST have been proposed, including treating teacher predictions as soft labels (Hinton et al., 2015), masking part of the label (Clark et al., 2018), or use multiple modules for the teacher (Zhou and Li, 2005;Ruder and Plank, 2018). Unlike classic ST approaches, we do not aim at improving the teacher models or creating efficient students; instead, we trained models to be used as sequential ranking components. This may be seen as a generalization of the ST approach, where the student needs to learn a simpler task than the teacher. However, our approach is significantly different from the traditional ST setting, which our preliminary investigation showed to be not very effective.

Preliminaries and Task Definition
We first formalize the problem of selecting the most likely element in a set as a reranking problem; then, we define sequential reranking (SR); finally, we contextualize AS2 task in such framework.

Max Element Selection
In general, a large class of NLP (and other) problems can be formulated as a max element selection task: given a query q and a set of candidates A = {a 1 , .., a n }, select a j that is an optimal element for q. We can model the task as a selector function π : Q × P(A) → A, defined as π(q, A) = a j , where P(A) is the powerset of A, j = argmax i p(q, a i ), and p(q, a i ) is the probability of a i to be the required element. p(q, a i ) can be estimated using a neural network model. In the case of transformers, said model can be optimized using a point-wise loss, i.e., we only use the target candidate to generate the selection probability. Pairwise or list-wise approaches can still be used (Bian et al., 2017), but (i) they would not change the findings of our study, and (ii) point-wise methods have been shown to achieve competitive performance in the case of transformer models.

Search with Sequential Reranking (SR)
Assuming that no heuristics are available to preselect a subset of most-likely candidates, max element selection requires evaluating each sample using a relevance estimator. Instead of a single estimator, it is often more efficient to use a sequence of rerankers to progressively reduce the number of candidates.
We define a reranker as a function R : Q × P(A) → P(A), which takes a subset Σ ⊆ A, and returns a set of elements, R(q, Σ) = {a i1 , ..., a ik } ⊂ Σ of size k, with the highest probability to be relevant to the query. That is, Given a sequence of rerankers sorted in terms of computational efficiency, (R 1 ,R 2 , . . . ,R N ), we assume that the ranking accuracy, A (e.g., in terms of MAP and MRR), increases in reverse order of the efficiency, i.e., Then, we define a Sequential Reranker of order N as the composition of N rerankers: , the number of elements the reranker returns. Depending on the values of k i , SR models with different trade-offs between accuracy and efficiency can be obtained. 3

AS2 Definition
The definition of AS2 directly follows from the definition of element selection of Section 3.1, where the query is a natural language question and the elements are answer sentence candidates retrieved with any approach, e.g., using a search engine.

SR with transformers
In this section, we explain how to exploit the hierarchical architecture of a traditional transformer model to build an SR model. First, we briefly recap how traditional transformer models (we refer to them as "monolithic") are used for sequence classification, and how to derive a set of sequential rerankers from a pre-trained transformer model (Section 4.1). Then, we introduce our Cascade Transformer (CT) model, a SR model that efficiently uses partial encodings of its input to build a set of sequential rerankers R i (Section 4.3). Finally, we explain how such model is trained and used for inference in sections 4.3.1 and 4.3.2, respectively.

Monolithic Transformer Models
We first briefly describe the use of transformer models for sequence classification. We call them monolithic as, for all input samples, the computation flows from the first until the last of their layers.
Let T = {E; L 1 , L 2 , . . . , L n } be a standard stacked transformer model (Vaswani et al., 2017), where E is the embedding layer, and L i are the  In this example, drop rate α= 0.4 causes sample X 3 to be removed by partial classifier C ρ(i) .
transformer layers 4 generating contextualized representations for an input sequence; n is typically referred to as the depth of the encoder, i.e., the number of layers. Typical values for n range from 12 to 24, although more recent works have experimented with up to 72 layers (Shoeybi et al., 2019).
T can be pre-trained on large amounts of unlabeled text using a masked Liu et al., 2019c) or autoregressive (Yang et al., 2019c; language modeling objective. Pre-trained language models are fine-tuned for the target tasks using additional layers and data, e.g., a fully connected layer is typically stacked on top of T to obtain a sentence classifier. Formally, given a sequence of input symbols 5 , X = {x 0 , x 1 , . . . , x m }, an encoding H = {h 0 , h 1 , . . . , h m } is first obtained by recursively applying H i to the input: where H = H n . Then, the first symbol of the input sequence 6 is fed into a sequence of dense feedforward layers D to obtain a final output score, i.e., y = D(h 0 ). D is fine-tuned together with the entire model on a task-specific dataset (a set of question and candidate answer pairs, in our case).

Transformer-based Sequential Reranker (SR) Models
Monolithic transformers can be easily modified or combined to build a sequence of rerankers as described in Seciton 3.2. In our case, we adapt an existing monolithic T to obtain a sequence of N rerankers R i . Each R i consists of encoders from T up to layer ρ(i), followed by a classification layer D i , i.e., R i = {E; L 1 , . . . , L ρ(i) , D i }. For a sequence of input symbols X, all rerankers in the sequence are designed to predict p(q, a), which we indicate as R i (X) = y ρ(i) . All rerankers in SR N are trained independently on the target data.
In our experiments, we obtained the best performance by setting N = 5 and using the following formula to determine the architecture of each reranker R i : ρ(i) = 4 + 2· (i − 1) ∀i = {1, . . . , 5} In other words, we assemble sequential reranker SR 5 using five rerankers built with transformer models of 4, 6, 8, 10 and 12 layers, respectively. This choice is due to the fact that our experimental results seem to indicate that the information in layers 1 to 3 is not structured enough to achieve satisfactory classification performance for our task. This observation is in line with recent works on the effectiveness of partial encoders for semantic tasks similar to AS2 (Peters et al., 2019).

Cascade Transformer (CT) Models
During inference, monolithic transformer models evaluate a sequence X through the entire computation graph to obtain the classification scores Y . order for the model to distinguish between the two, a special token such as "[SEP]" or "</s>" is used. Some models also use a second embedding layer to represent which sequence each symbol comes from. 6 Before being processed by a transformer model, sequences are typically prefixed by a start symbol, such as "[CLS]" or "<s>". This allows transformer models to accumulate knowledge about the entire sequence at this position without compromising token-specific representations . This means that when using SR N , examples are processed multiple times by similar layers for different R i , e.g., for i = 1, all R i compute the same operations of the first ρ(i) transformer layers, for i = 2, N − 1 rerankers compute the same ρ(i) − ρ(i + 1), layers and so on. A more computationally-efficient approach is to share all the common transformer blocks between the different rerankers in SR N .
We speed up this computation by using one transformer encoder to implement all required R i . This can be easily obtained by adding a classification layer C ρ(i) after each ρ(i) layers (see Figure 1). Consequently, given a sample X, the classifiers C ρ(i) produces scores y ρ(i) only using a partial encoding. To build a CT model, we use each C ρ(i) to build rerankers R i , and select the top k i candidates to score with the subsequent rerankers R i+1 . We use the same setting choices of N and ρ(i) described in Section 4.2.
Finally, we observed the best performance when all encodings in H ρ(i) are used as input to partial classifier C ρ(i) , rather than just the partial encoding of the classification token h ρ(i),0 . Therefore, we use their average to obtain score y ρ(i) = C ρ(i) ( 1 m l=1,..,m h ρ(i),l ), In line with Kovaleva et al. (2019), we hypothesize that, at lower encoding layers, long dependencies might not be properly accounted in h ρ(i),0 . However, in our experiments, we found no benefits in further parametrizing this operation, e.g., by either using more complex networks or weighting the average operation.

Training CT
The training of the proposed model is conducted in a multi-task fashion. For every mini-batch, we randomly sample one of the rankers R i (including the final output ranker), calculate its loss against the target labels, and back-propagate its loss throughout the entire model down to the embedding layers. We experimented with several more complex sampling strategies, including a round-robin selection process and a parametrized bias towards early rankers for the first few epochs, but we ultimately found that uniform sampling works best. We also empirically determined that, for all classifiers C ρ (i), backpropagating the loss to the input embeddings, as opposed to stopping it at layer ρ(i − 1), is crucial to ensure convergence. A possible explanation could be: enabling each classifier to influence the input representation during backpropagation ensures that later rerankers are more robust against variance in partial encodings, induced by early classifiers. We experimentally found that if the gradient does not flow throughout the different blocks, the development set performance for later classifiers drops when early classifiers start converging.

Inference
Recall that we are interested in speeding up inference for classification tasks such as answer selection, where hundreds of candidates are associated with each question. Therefore, we can assume without loss of generality that each batch of samples B = {X 1 , . . . , X b } contains candidate answers for the same question. We use our partial classifiers to throw away a fraction α of candidates, to increase throughput. That is, we discard k i = α· k i−1 candidates, where · rounds α· k i−1 down to the closest integer.
For instance, let α = 0.3, batch size b = 128; further, recall that, in our experiments, a CT consists of 5 cascade rerankers. Then, after layer 4, the size of the batch gets reduced to 90 ( 0.3· 128 = 38 candidates are discarded by the first classifier). After the second classifier (layer 6), 0.3· 90 = 27 examples are further removed, for an effective batch size of 63. By layer 12, only 31 samples are left, i.e., the instance number scored by the final classifier is reduced by more than 4 times.
Our approach has the effect of improving the throughput of a transformer model by reducing the average batch size during inference: the throughput of any neural model is capped by the maximum number of examples it can process in parallel (i.e., the size of each batch), and said number is usually ceiled by the amount of memory available to the model (e.g., RAM on GPU). The monolithic models have a constant batch size at inference; however, because the batch size for a cascade model varies while processing a batch, we can size our network with respect to its average batch size, thus increasing the number of samples we initially have in a batch. In the example above, suppose that the hardware requirement dictates a maximum batch size of 84 for the monolithic model. As the average batch size for the cascading model is (4· 128 + 2· 90 + 2· 63 + 2· 44 + 2· 28)/12 = 80.2 < 84, we can process a batch of 128 instances without violating memory constrains, increasing throughput by 52%.
We remark that using a fixed α is crucial to obtain the performance gains we described: if we were to employ a score-based thresholding ap- proach (that is, discard all candidates with score below a given threshold), we could not determine the size of batches throughout the cascade, thus making it impossible to efficiently scale our system. On the other hand, we note that nothing in our implementations prevents potentially correct candidates from being dropped when using CT. However, as we will show in Section 5, an opportune choice of a threshold and good accuracy of early classifiers ensure high probability of having at least one positive example in the candidate set for the last classifier of the cascade.

Experiments
We present three sets of experiments designed to evaluate CT. In the first (Section 5.3), we show that our proposed approach without any selection produces comparable or superior results with respect to the state of the art of AS2, thanks to its stability properties; in the second (Section 5.4), we compare our Cascade Transformer with a vanilla transformer, as well as a sequence of transformer models trained independently; finally, in the third (Section 5.5), we explore the tuning of the drop ratio, α.

Datasets
TRECQA & WikiQA Traditional benchmarks used for AS2, such as TRECQA (Wang et al., 2007) and WikiQA (Yang et al., 2015), typically contain a limited number of candidates for each question. Therefore, while they are very useful to compare accuracy of AS2 systems with the state of the art, they do not enable testing large scale passage reranking, i.e., inference on hundreds or thousand of answer candidates. Therefore, we evaluated our approach (Sec. 4.3) on two datasets: ASNQ, which is publicly available, and our GPD dataset. We still leverage TRECQA and WikiQA to show that that our cascade system has comparable performance to state-of-the-art transformer models when no filtering is applied.
ASNQ The Answer Sentence Natural Questions dataset (Garg et al., 2020) is a large collection (23M samples) of question-answer pairs, which is two orders of magnitude larger than most public AS2 datasets. It was obtained by extracting sentence candidates from the Google Natural Question (NQ) benchmark (Kwiatkowski et al., 2019). Samples in NQ consists of tuples question, answer long , answer short , label , where answer long contains multiple sentences, answer short is fragment of a sentence, and label is a binary value indicating whether answer long is correct. The positive samples were obtained by extracting sentences from answer long that contain answer short ; all other sentences are labeled as negative. The original release of ANSQ 7 only contains train and development splits; we further split the dev. set to both have dev. and test sets.
GPD The General Purpose Dataset is part of our efforts to study large scale web QA and evaluate performance of AS2 systems. We built GPD using a search engine to retrieve up to 100 candidate documents for a set of given questions. Then, we extracted all candidate sentences from such documents, and rank them using a vanilla transformer model, such as the one described in Sec. 4.1. Finally, the top 100 ranked sentences were manually annotated as correct or incorrect answers. We measure the accuracy of our approach on ASNQ and GPD using four metrics: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Precision at 1 of ranked candidates (P@1), and Normalized Discounted Cumulative Gain at 10 of retrieved candidates (nDCG@10). While the first two metrics capture the overall system performance, the latter two are better suited to evaluate systems with many candidates, as they focus more on Precision. For WikiQA and TRECQA, we use MAP and MRR.

Models and Training
Our models are fine-tuned starting from a pretrained RoBERTa encoder (Liu et al., 2019c). We chose this transformer model over others due to its strong performance on answer selection tasks (Garg et al., 2020). Specifically, we use the BASE 7 https://github.com/alexa/wqa_tanda  (Wang and Jiang, 2016) 74.3 75.4 --CA2 (Yoon et al., 2019) 83.4 84.8 87.5 94.0 TANDABASE (Garg et al., 2020)   With the exception of a 4-layer transformer, both the partial and final classifiers from CT achieve comparable or better performance than state of the art models.
variant (768-dimensional embeddings, 12 layers, 12 heads, and 3072 hidden units), as it is more appropriate for efficient classification. When applicable 8 , we fine-tune our models using the two-step "transfer and adapt" (TANDA) technique introduced by Garg et al. (2020). As mentioned in Section 4.3, we optimize our model in a multi-task setting; that is, for each minibatch, we randomly sample one of the output layers of the CT classifiers to backpropagate its loss to all layers below.
While we evaluated different sampling techniques, we found that a simple uniform distribution is sufficient and allows the model to converge quickly.
Our models are optimized using Adam (Kingma and Ba, 2014) using triangular learning rate (Smith, 2017) with a 4, 000 updates ramp-up 9 , and a peak learning rate l r = 1e −6 . Batch size was set to up to 2, 000 tokens per mini-batch for CT models. For the partial and final classifiers, we use 3-layers feedforward modules with with 768 hidden units and tanh activation function. Like the original BERT implementation, we use dropout value of 0.1 on all dense and attention layers. We implemented our system using MxNet 1.5 (Chen et al., 2015) and GluonNLP 0.8.1 (Guo et al., 2019a) Table 3: Comparison of Cascade Transformers with other models on the ASNQ and GPD datasets. "Monolithic transformer" refers to a single transformer model trained independently; "sequential ranker" (ST) is a sequence of monolithic transformer models of size 4, 6, . . . , 12 trained independently; and "Cascade Transformer" (CT) is the approach we propose. This can train models that equal or outperform the state of the art when no drop is applied (i.e., α = 0.0); with drop, they obtain the same performance with 37% to 51% fewer operations.

Stability Results of Cascade Training
In oder to better assess how our training strategy for CT models compare with a monolithic transformer, we evaluated the performance of our system on two well known AS2 datasets, WikiQA and TRECQA. The results of these experiments are presented in Table 2. Note how, in this case, we are not applying any drop to our cascade classifier, as it is not necessary on this dataset: all sentences fit comfortably in one mini batch (see dataset statistics in Table 1), so we would not observe any advantage in pruning candidates. Instead, we focus on evaluating how our training strategy affects performance of partial and final classifiers of a CT model. Our experiment shows that classifiers in a CT model achieve competitive performance with respect to the state of the art: our 12-layer transformer model trained in cascade outperforms TANDA BASE by 0.8 and 0.9 absolute points in MAP (0.9 and 0.7 in MRR). 10, 8, and 6 layer models are equally comparable, differing at most by 2.3 absolute MAP points on WikiQA, and outscoring TANDA by up to 11.2 absolute MAP points on TRECQA. However, we observed meaningful differences between the performance of the 4-layers cascade model and its monolithic counterparts. We hypothesize that this is due to the fact that lower layers are not typically well suited for classification when used as part of a larger model (Peters et al., 2019); this observation is reinforced by the fact that the 4 layers TANDA model shown in Table 2 takes four times the number of the iterations of any other model to converge to a local optimum.
Overall, these experiments show that our training strategy is not only effective for CT models, but can also produce smaller transformer models with good accuracy without separate fine-tuning.

Results on Effectiveness of Cascading
The main results for our CT approach are presented in Table 3: we compared it with (i) a state-of-the-art monolithic transformer (TANDA BASE ), (ii) smaller, monolithic transformer models with 4-10 layers, and (iii) a sequential ranker (SR) consisting of 5 monolithic transformer models with 4, 6, 8, 10 and 12 layers trained independently. For CT, we report performance of each classifier individually (layers 4 up to 12, which is equivalent to a full transformer model). We test SR and CT with drop ratio 30%, 40%, 50%. Finally, for each model, we report the relative cost per batch compared to a base transformer model with 12 layers.
Overall, we observed that our cascade models are competitive with monolithic transformers on both ASNQ and GPD datasets. In particular, when no selection is applied (α = 0.0), a 12 layer cascade model performs equal or better to TANDA BASE : on ASNQ, we improve P@1 by 2.1% (53.2 vs 52.1), and MAP by 1.2% (66.3 vs 65.5); on GDP, we achieve the same P@1 (67.5), and a slightly lower MAP (57.8 vs 58.0). This indicates that, despite the multitasking setup, out method is competitive with the state of the art.
A drop rate α > 0.0 produces a small degradation in accuracy, at most, while significantly reducing the number of operations per batch (−37%). In particular, when α = 0.3, we achieve less than 2% drop in P@1 on GPD, when compared to TANDA BASE ; on ANSQ, we slightly improve over it (52.9 vs 52.1). We observe a more pronounced drop in performance for MAP, this is to be expected, as intermediate classification layers are designed to drop a significant number of candidates.
For larger values of α, such as 0.5, we note that we achieve significantly better performance than monolithic transformer of similar computational cost. For example, CT achieves an 11.2% improvement in P@1 over a 6-layers TANDA model (62.4 vs 56.1) on GPD; a similar improvement is obtained on ANSQ (+11.0%, 52.4 vs 47.2). Finally, our model is also competitive with respect to a sequential transformer with equivalent drop rates, while being between 1.9 to 2.4 times more efficient. This is because an SR model made of independent TANDA models cannot re-use encodings generated by smaller models as CT does.

Results on Tuning of Drop Ratio α
Finally, we examined how different values for drop ratio α affect the performance of CT models. In particular, we performed an exhaustive grid-search on a CT model trained on the GPD dataset for drop ratio values {α p 1 , α p 2 , α p 3 , α p 4 }, with α p k ∈ {0.1, 0.2, . . . , 0.6}. The performance is reported in Figure 2 with respect to the relative computational cost per batch of a configuration when compared with a TANDA BASE model.
Overall, we found that CT models are robust with respect to the choice of {α p k } 4 k=1 . We observe moderate degradation for higher drop ratio values (e.g., P@1 varies from 85.6 to 80.0). Further, as expected, performance increases for models with higher computational cost per batch, although they taper off for CT models with relative cost ≥ 70%. On the other hand, the grid search results do not seem to suggest an effective strategy to pick optimal values for {α p k } 4 k=1 , and, in our experiments, we ended up choosing the same values for all drop rates. In the future, we would be like to learn such values while training the cascade model itself.

Conclusions and Future Work
This work introduces CT, a variant of the traditional transformer model designed to improve inference The three runs reported in Table 3 correspond to (α = 0.3), (α = 0.4), and (α = 0.5).
throughput. Compared to a traditional monolithic stacked transformer model, our approach leverages classifiers placed at different encoding stages to prune candidates in a batch and improve model throughput. Our experiments show that a CT model not only achieves comparable performance to a traditional transformer model while reducing computational cost per batch by over 37%, but also that our training strategy is stable and jointly produces smaller transformer models that are suitable for classification when higher throughput and lower latency goals must be met. In future work, we plan to explore techniques to automatically learn where to place intermediate classifiers, and what drop ratio to use for each one of them.