Mediators in Determining what Processing BERT Performs First

Probing neural models for the ability to perform downstream tasks using their activation patterns is often used to localize what parts of the network specialize in performing what tasks. However, little work addressed potential mediating factors in such comparisons. As a test-case mediating factor, we consider the prediction’s context length, namely the length of the span whose processing is minimally required to perform the prediction. We show that not controlling for context length may lead to contradictory conclusions as to the localization patterns of the network, depending on the distribution of the probing dataset. Indeed, when probing BERT with seven tasks, we find that it is possible to get 196 different rankings between them when manipulating the distribution of context lengths in the probing dataset. We conclude by presenting best practices for conducting such comparisons in the future.


Introduction
The strong performance of end-to-end models and the difficulty in understanding their inner workings has led to extensive research aimed at interpreting their behavior (Li et al., 2016;Karpathy et al., 2015). This notion has led researchers to investigate the behavioral traits of networks in general Hacohen et al., 2020) and representative architectures in particular (Schlichtkrull et al., 2020). Within NLP, Transformer-based pretrained embeddings are the basis for many tasks, which underscores the importance in interpreting their behavior (Belinkov et al., 2020), and especially the behavior of BERT (Devlin et al., 2019;Rogers et al., 2020), perhaps the most widely used of Transformer-based models.
In this work, we analyze the common approach of probing ( §2), used to localize where "knowledge" of particular tasks is encoded; localization is often carried out in terms of the layers most responsible for the task at hand (c.f. Tenney et al., 2019b). Various works (Tenney et al., 2019a;Peters et al., 2018;Blevins et al., 2018) showed that some tasks are processed in lower levels than others.
We examine the extent to which potential mediating factors may account for observed trends and show that varying some mediating factors (see §2) may diminish, or even reverse, the conclusions made by Tenney et al. (T19;2019a). Specifically, despite reaffirming T19's experimental findings, we contest T19's interpretation of the results, namely that the processing carried out by BERT parallels the classical NLP pipeline. Indeed, T19 concludes that lexical tasks (POS tagging) are performed by the lower layers, followed by syntactic tasks, whereas more semantic tasks are performed later on. This analysis rests on the assumption that the nature of the task (lexical, syntactic, or semantic) is the driving force that determines what layer performs what analysis. We show that other factors should be weighed in as well. Specifically, we show that manipulating the distribution of examples in the probing dataset can lead to a variety of different conclusions as to what tasks are performed first.
We argue that potential mediators must be considered when comparing tasks, and focus on one such mediator -the context length, which we define as the number of tokens whose processing is minimally required to perform the prediction. We operationalize this notion by defining it as the maximal distance between any two tokens for which a label is predicted. This amounts to the span length in tasks that involve a single span (e.g., NER), and to the dependency length in tasks that address the relation between two spans. See §2. Our motivation for considering context length as a mediator is grounded in previous work that presented the difficulty posed by long-distance dependencies in various NLP tasks (Xu et al., 2009;Sennrich, 2017), and particularly in previous work that indicated the Transformers' difficulty to generalize across different dependency lengths (Choshen and Abend, 2019).
We show that in some of the cases where one task seems to be better predicted by a higher layer than another task, controlling for context length may reverse that order. Indeed we show that 196 different rankings between the seven tasks explored in T19 may be obtained with a suitable distribution over the probing datasets, namely 196 different ways to rank the tasks according to their expected layer. Moreover, our results show that when context length is not taken into account, one task (e.g., dependency parsing) may seem to be processed at a higher layer than another (e.g., NER), when its expected layer (see §2) is, in fact, lower for all ranges of context lengths ( §3.1.1).

Background
We begin by laying out the terminology and methodology we will use in the paper.
Edge Probing. Edge probing is the method of training a classifier for a given task on different parts of the network (without fine-tuning). Success in classification is interpreted as evidence that the required features for classification are somehow encoded in the examined part and are sufficiently easy to extract. In our experiments, we follow T19 and probe BERT with Named Entity Recognition (NER), a constituent-based task (classifying Non-terminals -Non-term.), Semantic Role Labeling (SRL), Co-reference (Co-ref.), Semantic Proto-Roles (SPR; Reisinger et al., 2015), Relation Classification (RC) and the Stanford Dependency Parsing (Dep.;de Marneffe et al., 2006).
Localization by Expected Layer. The expected layer metric (which we will henceforth refer to it as E layer ) of T19 assesses which layer in BERT is most needed for prediction: a probing classifier P (l) is trained on the lowest l layers. Then, a differential score ∆ (l) is computed, which indicates the performance gain when taking into account one additional layer: l=1 are computed, we may compute E layer : (2) Therefore, unlike standard edge probing, which is performed on each layer individually, computing E layer takes into account all layers up to a given l.
Mediation Analysis. Each of the explored tasks classifies one or two input sub-spans. In both cases, we define the context length to be the distance between the earliest and latest span index. Namely, for tasks with two spans (e.g., SPR), span 1 =[i 1 ,j 1 ] and span 2 =[i 2 ,j 2 ], where span 1 appears before span 2 , the context length is j 2 -i 1 , whereas for tasks with just one span (e.g., NER), In order to examine the effect of context length on E layer , we model it as a mediating factor, namely as an intermediate variable that (partly) explains the relationship between two other variables (in this work, a task and its E layer ). See Figure 1.
We bin each task's test set into non-overlapping bins, according to their context length ranges. We use the notation 'i-j' to denote the bin of context lengths in the range [i,j]. For example, the second bin would be '3-5', denoting context lengths 3, 4, and 5. In addition, given a specific task, two possible approaches exist to examine the mediation effect of context length on the task's E layer . The first one bins all the task's data into sub-sets, in advance. Then, this approach fine-tunes over each subset separately. Alternatively, the second approach fine-tunes over the whole dataset, binning only during the test phase. We follow the latter approach, as it is more computationally efficient.
T C E layer Figure 1: The relationship we stipulate between the task, the context length, and E layer . We use two random variables: T is the task, which can be any of the seven tasks we observe and C is the context length.
Interestingly, in §3.1.1, we encounter a special edge case, where the aggregated average (i.e., E layer ) of one task is higher than another, whereas in each sub-set (by a given context length) it is lower. This may occur when the weight of the sub-sets differs between the two aggregations.

Experiments
We hypothesize that the context length is a mediating factor in the E layer of a task. In order to test this hypothesis, we run the following experiments, aiming at isolating the context length.
We use the SPR1 dataset (Reisinger et al., 2015) to probe SPR, the English Web Treebank for the Dep. task (Silveira et al., 2014), the SemEval 2010 Task 8 for the RC task (Hendrickx et al., 2009), and the OntoNotes 5.0 dataset (Weischedel et al., 2013) for the other tasks. Configurations follow the defaults in the Jiant toolkit implementation (Wang et al., 2019). In addition, we work with the BERTbase model.

The Effect on E layer
First, we wish to confirm that context length indeed affects E layer and that the task is not a sole contributor to this. Given a task and a threshold thr, we compile a dataset for the task containing the subset of examples with context lengths shorter than thr, and use it to compute E layer . We do it for all tasks and for every integer threshold between 0 and a maximal threshold, which is selected separately for each task to ensure that at least 2000 instances remain in the last bin.
We find that context length plays an important role in the difference between the expected layers ( Figure 2). Most notably, the Co-ref., SRL, Dep., and RC tasks' E layer increases when increasing the threshold.
Next, we divide the data into smaller bins of non-overlapping context length ranges, in order to control for the influence of the context lengths on the expected layers of the tasks. We compute E layer for sub-sets of similar lengths. In choosing the size of each such range, we try to balance between informativeness (narrower ranges) and reliability (having enough examples in each range, so as to reduce noise). We find that the narrowest range width that retains at least 1% of the examples in each bin is 3. We thus divide the dataset for each task into context length ranges of width 3, until the maximal threshold is reached. Higher context lengths are lumped into an additional bin.

Manipulating the Context Length
Distribution: An Extreme Case.
We begin by examining two specific tasks: Dep. and NER, and their E layer for each context length's range. We then consider, for simplicity, a case where all the context lengths of Dep. are of length 9+, while those of NER are in the range of 3-5 ( Figure 3). We see that when controlling for context length, Dep. is computed in a lower layer than NER, regardless of the range. However, depending on the distribution of context lengths in the probing dataset, the outcome may be completely different, with Dep. being processed in higher layers (for a similar example of a different task-pair, see §A.1).
These results indicate that the results of T19 do not necessarily indicate that BERT is performing a pipeline of computations (as is commonly asserted, see e.g., T19 and Blevins et al. (2018)), and that mediating factors need to be taken into account when interpreting E layer . Figure 3: E layer of NER and Dep. for different context length ranges (4 left blue and yellow pairs), and their E layer when all instances of NER are of context length l ∈ [3, 5] and all those of Dep. are of context length l ≥ 9 (rightmost green and red pair). While for every context length range, NER's E layer is bigger than that of Dep., for some context length distribution that order may be reversed.

Imposing Similar Length Distributions
In the previous section, we observed that one task can be both higher and lower than another. That depends on the distribution of context lengths in the probing dataset. We next ask whether such a "paradox" arises in experiments when imposing the same context length distributions on the two tasks.
Following Pearl (2001), we employ mediation analysis and specifically concentrate on the Natural Direct Effect (NDE), which is the difference between two of the observed dependent variables (in our case E layer ), when fixing the mediator. In our case, the NDE is the difference between the E layer of two tasks, while forcing the same context length distribution on both. For convenience, we force the distribution of one of the examined tasks (for more details, see §A.2), but any distribution is applicable. In general, the equation for computing the NDE of tasks t 1 and t 2 , with the context length distribution of t 1 imposed on both, is: where T is a random variable of the tasks, and C is a random variable of the context length.
We apply NDE twice for every pair of tasks (once for each task's context length distribution). We then compare the results to the difference between the tasks' expected layers where each task keeps its original context length distribution (unmediated). Results (Figure 4) show that the difference could be more than 50 times larger (change of 1.24 in absolute value) or decrease by 86% (0.73 in absolute value). In some cases the order of the two tasks is reversed, namely, the task that is lower with one distribution becomes higher with another. This shows that even among our examined set of seven tasks, the effect of potential mediators cannot be ignored. For more results, see §A.3.

Controlling for Context Length
After observing that the distribution of context length in the probing dataset may affect the relative order of the expected layers, we propose a more detailed and accurate method to compare the expected layers, which does not rely on a specific length distribution. We do so by plotting the controlled effect, namely E layer for each range separately.
Our results ( Figure 5) allow computing the range of possible expected layers for a task, that may result from taking any context length distribution ( Figure 6). The figure shows the wide range of possible relative behaviors of E layer for task-pairs: from notable to negligible difference in expected layers (e.g., SRL and Co-ref.), to pairs whose ordering of expected layers may be reversed (i.e., overlapping ranges, such as with SPR and RC). In fact, by taking into account every possible combination of context length distribution for each of the tasks, we get as many as 196 possible rankings of the seven tasks according to their E layer . One such possible order is, for example, Non-term. < Dep. < SRL < RC < NER < Co-ref. < SPR. We elaborate on this in §A.4.
To recap, we find that the difference in E layer between some tasks may considerably change and their order may reverse, depending on the context length. This finding lends further support to our claim that mediators should be taken into account.

Conclusion
We showed that when performing edge probing to identify what layers are responsible for addressing what tasks, it is imperative to take into account potential mediators, as they may be responsible for much of the observed effect. Specifically, we showed that context length has a significant impact on a task's E layer . Our analysis shows the wide range of relative orderings of the expected layers for different tasks when assuming different context length distributions; from extreme edge cases, like the one we observed in §3.1.1, to more common, but potentially misleading ones, where the difference between expected layers may dramatically increase or decrease depending on the context length distribution. Most importantly, it shows that by manipulating the context length distribution, we may get a wide range of outcomes.
Our work suggests that mediating factors should be taken into account when basing analysis on the E layer . On a broader note, alternative hypotheses should be considered, before limiting oneself to a single interpretation.
Future work will consider the effect of other mediating factors. The two methods we used, NDE and controlled effect, can be used to examine the impact of other mediating factors and should be adopted as part of the field's basic analysis toolkit (cf. Feder et al., 2020;Vig et al., 2020). NDE should be used when several effects are examined simultaneously, as it facilitates the assessment of their effect on the tasks' complexity. It is also advisable to use NDE when a more practical examination is required, i.e., when distributions of the mediators are given empirically, as it is easier to derive the mediating factors' impact using this method. In contrast, the controlled effect method should be used when examining the effects of two variables (e.g., tasks and mediating factors) or when comparing several tasks with one mediating effect.

A.1 Additional Example of the Extreme Case
We show another example of a task-pair that, under certain distributions of context lengths, exhibits similar behavior to that observed in the edge case described in §3. 1.1 (figure 7). Figure 7: E layer of SRL and Non-term. for different context length ranges (4 left blue and yellow pairs), and their E layer when all instances of SRL are of context length l ∈ [0, 2] and all those of Non-term. are of context length l ≥ 9 (rightmost green and red pair). While for every context length range, SRL's E layer is bigger than that of Non-term., for some context length distribution that order may be reversed.

A.2 Context Length Distribution
A lot of our work deals with possible context length distributions, normalizing distribution, and accounting for the distribution. We provide here the actual distributions which are the underlying property controlling the seen effects. We provide data on the percentage of examples in each context length range for each task (figure 8).

A.3 NDE vs. Unmediated Difference for All
Task-Pairs For every task-pair, we compare the unmediated E layer difference with the pair's NDE. Figure 9 presents this comparison for each task-pair, with the distribution of one of the pair's tasks being applied in the NDE calculations, for each task-pair.

A.4 Extreme E layer Differences
Based on figure 6, we compute the extreme E layer differences of each task-pair. Namely, for each such pair, we juxtapose the difference between the maximal possible E layer of the first task and the minimal E layer of the second one with the opposite case (the difference between the minimal possible E layer of the first task and the maximal E layer of the second one). Our results can be seen in figure  10. Figure 9: Difference between unmediated E layer and NDE for every task-pair. The employed context length distributions (as part of the NDE calculations) are, from left to right, of NER, SRL, Dep., Non-term., SRL, Co-ref., Dep., Non-term., SRL, Non-term., SPR, SRL, SPR, SPR, Non-term., SRL, RC, NER, Non-term., Dep. and SRL.
Figure 10: Difference between the minimal possible expected layer of the left task and the maximal possible expected layer of the right task (blue -see legend), and vice-versa (yellow -see legend), for every task-pair.