Information-Theoretic Probing with Minimum Description Length

To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine linguistic labels and probing for random synthetic tasks. To see reasonable differences in accuracy with respect to these random baselines, previous work had to constrain either the amount of probe training data or its model size. Instead, we propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL). With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. Therefore, the measure of interest changes from probe accuracy to the description length of labels given representations. In addition to probe quality, the description length evaluates"the amount of effort"needed to achieve the quality. This amount of effort characterizes either (i) size of a probing model, or (ii) the amount of data needed to achieve the high quality. We consider two methods for estimating MDL which can be easily implemented on top of the standard probing pipelines: variational coding and online coding. We show that these methods agree in results and are more informative and stable than the standard probes.


Abstract
To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine linguistic labels and probing for random synthetic tasks. To see reasonable differences in accuracy with respect to these random baselines, previous work had to constrain either the amount of probe training data or its model size. Instead, we propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL). With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. Therefore, the measure of interest changes from probe accuracy to the description length of labels given representations. In addition to probe quality, the description length evaluates 'the amount of effort' needed to achieve the quality. This amount of effort characterizes either (i) size of a probing model, or (ii) the amount of data needed to achieve the high quality. We consider two methods for estimating MDL which can be easily implemented on top of the standard probing pipelines: variational coding and online coding. We show that these methods agree in results and are more informative and stable than the standard probes. 1

Introduction
To estimate to what extent representations (e.g., ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019)) capture a linguistic property, most previous 1 We release code at https://github.com/ lena-voita/description-length-probing. work uses 'probing tasks' (aka 'probes' and 'diagnostic classifiers'); see Belinkov and Glass (2019) for a comprehensive review. These classifiers are trained to predict a linguistic property from 'frozen' representations, and accuracy of the classifier is used to measure how well these representations encode the property.
Despite widespread adoption of such probes, they fail to adequately reflect differences in representations. This is clearly seen when using them to compare pretrained representations with randomly initialized ones (Zhang and Bowman, 2018). Analogously, their accuracy can be similar when probing for genuine linguistic labels and probing for tags randomly associated to word types ('control tasks', Hewitt and Liang (2019)). To see differences in the accuracy with respect to these random baselines, previous work had to reduce the amount of a probe training data (Zhang and Bowman, 2018) or use smaller models for probes (Hewitt and Liang, 2019).
As an alternative to the standard probing, we take an information-theoretic view at the task of measuring relations between representations and labels. Any regularity in representations with respect to labels can be exploited both to make predictions and to compress these labels, i.e., reduce length of the code needed to transmit them. Formally, we recast learning a model of data (i.e., training a probing classifier) as training it to transmit the data (i.e., labels) in as few bits as possible. This naturally leads to a change of measure: instead of evaluating probe accuracy, we evaluate minimum description length (MDL) of labels given representations, i.e. the minimum number of bits needed to transmit the labels knowing the representations. Note that since labels are transmitted using a model, the model has to be transmitted as well (directly or indirectly). Thus, the overall codelength is a combination of the quality of fit of the model (compressed data length) with the cost of transmitting the model itself.
Intuitively, codelength characterizes not only the final quality of a probe, but also the 'amount of effort' needed achieve this quality (Figure 1). If representations have some clear structure with respect to labels, the relation between the representations and the labels can be understood with less effort; for example, (i) the 'rule' predicting the label (i.e., the probing model) can be simple, and/or (ii) the amount of data needed to reveal this structure can be small. This is exactly how our vague (so far) notion of 'the amount of effort' is translated into codelength. We explain this more formally when describing the two methods for evaluating MDL we use: variational coding and online coding; they differ in a way they incorporate model cost: directly or indirectly.
Variational code explicitly incorporates cost of transmitting the model (probe weights) in addition to the cost of transmitting the labels; this joint cost is exactly the loss function of a variational learning algorithm (Honkela and Valpola, 2004). As we will see in the experiments, close probe accuracies often come at a very different model cost: the 'rule' (the probing model) explaining regularity in the data can be either simple (i.e., easy to communicate) or complicated (i.e., hard to communicate) depending on the strength of this regularity.
Online code provides a way to transmit data without directly transmitting the model. Intuitively, it measures the ability to learn from different amounts of data. In this setting, the data is transmitted in a sequence of portions; at each step, the data transmitted so far is used to understand the regularity in this data and compress the following portion. If the regularity in the data is strong, it can be revealed using a small subset of the data, i.e., early in the transmission process, and can be exploited to efficiently transmit the rest of the dataset. The online code is related to the area under the learning curve, which plots quality as a function of the number of training examples.
If we now recall that, to get reasonable differ-ences with random baselines, previous work manually tuned (i) model size and/or (ii) the amount of data, we will see that these were indirect ways of accounting for the 'amount of effort' component of (i) variational and (ii) online codes, respectively. Interestingly, since variational and online codes are different methods to estimate the same quantity (and, as we will show, they agree in the results), we can conclude that the ability of a probe to achieve good quality using a small amount of data and its ability to achieve good quality using a small probe architecture reflect the same property: strength of the regularity in the data. In contrast to previous work, MDL incorporates this naturally in a theoretically justified way. Moreover, our experiments show that, differently from accuracy, conclusions made by MDL probes are not affected by an underlying probe setting, thus no manual search for settings is required.
We illustrate the effectiveness of MDL for different kinds of random baselines. For example, when considering control tasks (Hewitt and Liang, 2019), while probes have similar accuracies, these accuracies are achieved with a small probe model for the linguistic task and a large model for the random baseline (control task); these architectures are obtained as a byproduct of MDL optimization and not by manual search.
Our contributions are as follows: • we propose information-theoretic probing which measures MDL of labels given representations; • we show that MDL naturally characterizes not only probe quality, but also 'the amount of effort' needed to achieve it; • we explain how to easily measure MDL on top of standard probe-training pipelines; • we show that results of MDL probing are more informative and stable than those of standard probes.

Information-Theoretic Viewpoint
Let D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} be a dataset, where x 1:n = (x 1 , x 2 , . . . , x n ) are representations from a model and y 1:n = (y 1 , y 2 , . . . , y n ) are labels for some linguistic task (we assume that y i ∈ {1, 2, . . . , K}, i.e. we consider classification tasks). As in standard probing task, we want to measure to what extent x 1:n encode y 1:n . Differently from standard probes, we propose to look at this question from the information-theoretic perspective and define the goal of a probe as learning to effectively transmit the data.
Setting. Following the standard information theory notation, let us imagine that Alice has all (x i , y i ) pairs in D, Bob has just the x i 's from D, and that Alice wants to communicate the y i 's to Bob. The task is to encode the labels y 1:n knowing the inputs x 1:n in an optimal way, i.e. with minimal codelength (in bits) needed to transmit y 1:n .
Transmission: Data and Model. Alice can transmit the labels using some probabilistic model of data p(y|x) (e.g., it can be a trained probing classifier). Since Bob does not know the precise trained model that Alice is using, some explicit or implicit transmission of the model itself is also required. In Section 2.1, we explain how to transmit data using a model p. In Section 2.2, we show direct and indirect ways of transmitting the model.
Interpretation: quality and 'amount of effort'. In Section 2.3, we show that total codelength characterizes both probe quality and the 'amount of effort' needed to achieve it. We draw connections between different interpretations of this 'amount of effort' part of the code and manual search for probe settings done in previous work. 2

Transmission of Data Using a Model
Suppose that Alice and Bob have agreed in advance on a model p, and both know the inputs x 1:n . Then there exists a code to transmit the labels y 1:n losslessly with codelength 3 L p (y 1:n |x 1: This is the Shannon-Huffman code, which gives an optimal bound on the codelength if the data are independent and come from a conditional probability distribution p(y|x).
Learning is compression. The bound (1) is exactly the categorical cross-entropy loss evaluated on the model p. This shows that the task of compressing labels y 1:n is equivalent to learning a model p(y|x): quality of a learned model p(y|x) is the codelength needed to transmit the data.
Compression is usually compared against uniform encoding which does not require any learning from data. It assumes p(y|x) = p unif (y|x) = 1 K , and yields codelength L unif (y 1:n |x 1:n ) = n log 2 K bits. Another trivial encoding ignores input x and relies on class priors p(y), resulting in codelength H(y).
Relation to Mutual Information. If the inputs and the outputs come from a true joint distribution q(x, y), then, for any transmission method with codelength L, it holds E q [L(y|x)] ≥ H(y|x) (Grunwald, 2004). Therefore, the gain in codelength over the trivial codelength H(y) is In other words, the compression is limited by the mutual information (MI) between inputs (i.e. pretrained representations) and outputs (i.e. labels).
Note that total codelength includes model codelength in addition to the data code. This means that while high MI is necessary for effective compression, a good representation is the one which also yields simple models predicting y from x, as we formalize in the next section.

Transmission of the Model (Explicit or Implicit)
We consider two compression methods that can be used with deep learning models (probing classifiers): • variational code -an instance of two-part codes, where a model is transmitted explicitly and then used to encode the data; • online code -a way to encode both model and data without directly transmitting the model.

Variational Code
We assume that Alice and Bob have agreed on a model class H = {p θ |θ ∈ Θ}. With two-part codes, for any model p θ * , Alice first transmits its parameters θ * and then encodes the data while relying on the model. The description length decomposes accordingly: To compute the description length of the parameters L param (θ * ), we can further assume that Alice and Bob have agreed on a prior distribution over the parameters α(θ * ). Now, we can rewrite the total description length as where m is the number of parameters and is a prearranged precision for each parameter. With deep learning models, such straightforward codes for parameters are highly inefficient. Instead, in the variational approach, weights are treated as random variables, and the description length is given by the expectation where is a distribution encoding uncertainty about the parameter values. The distribution β(θ) is chosen by minimizing the codelength given in Expression (3). The formal justification for the description length relies on the bits-back argument (Hinton and von Cramp, 1993;Honkela and Valpola, 2004;MacKay, 2003). However, the underlying intuition is straightforward: parameters we are uncertain about can be transmitted at a lower cost as the uncertainty can be used to determine the required precision. The entropy term in Equation (3), H(β) = E θ∼β log 2 β(θ), quantifies this discount.
The negated codelength −L var β (y 1:n |x 1:n ) is known as the evidence-lower-bound (ELBO) and used as the objective in variational inference. The distribution β(θ) approximates the intractable posterior distribution p(θ|x 1:n , y 1:n ). Consequently, any variational method can in principle be used to estimate the codelength.
In our experiments, we use the network compression method of Louizos et al. (2017). Similarly to variational dropout (Molchanov et al., 2017), it uses sparsity-inducing priors on the parameters, pruning neurons from the probing classifier as a byproduct of optimizing the ELBO. As a result we can assess the probe complexity both using its description length KL(β α) and by inspecting the discovered architecture.

Online (or Prequential) Code
The online (or prequential) code (Rissanen, 1984) is a way to encode both the model and the labels without directly encoding the model weights. In the online setting, Alice and Bob agree on the form of the model p θ (y|x) with learnable parameters θ, its initial random seeds, and its learning algorithm. They also choose timesteps 1 = t 0 < t 1 < · · · < t S = n and encode data by blocks. 4 Alice starts by communicating y 1:t 1 with a uniform code, then both Alice and Bob learn a model p θ 1 (y|x) that predicts y from x using data {(x i , y i )} t 1 i=1 , and Alice uses that model to communicate the next data block y t 1 +1:t 2 . Then both Alice and Bob learn a and use it to encode y t 2 +1:t 3 . This process continues until the entire dataset has been transmitted. The resulting online codelength is In this sequential evaluation, a model that performs well with a limited number of training examples will be rewarded by having a shorter codelength (Alice will require fewer bits to transmit the subsequent y t i :t i+1 to Bob). The online code is related to the area under the learning curve, which plots quality (in case of probes, accuracy) as a function of the number of training examples. We will illustrate this in Section 3.2.

Interpretations of Codelength
Connection to previous work. To get larger differences in scores compared to random baselines, previous work tried to (i) reduce size of a probing model and (ii) reduce the amount of a probe training data. Now we can see that these were indirect ways to account for the 'amount of effort' component of (i) variational and (ii) online codes, respectively.
Online code and model size. While the online code does not incorporate model cost explicitly, we can still evaluate model cost by interpreting the difference between the cross-entropy of the model trained on all data and online codelength as the cost of the model. The former is codelength of the data if one knows model parameters, the latter (online codelength) -if one does not know them. In Section 3.2 we will show that trends for model cost evaluated for the online code are similar to those for the variational code. It means that in terms of a code, the ability of a probe to achieve good quality using small amount of data or using a small probe architecture reflect the same property: the strength of the regularity in the data.
Which code to choose? In terms of implementation, the online code uses a standard probe along with its training setting: it trains the probe on increasing subsets of the dataset. Using the variational code requires changing (i) a probing model to a Bayesian model and (ii) the loss function to the corresponding variational loss (3) (i.e. adding the model KL term to the standard data cross-entropy). As we will show later, these methods agree in results. Therefore, the choice of the method can be done depending on the preferences: the variational code can be used to inspect the induced probe architecture, but the online code is easier to implement.

Description Length and Control Tasks
Hewitt and Liang (2019) noted that probe accuracy itself does not necessarily reveal if the representations encode the linguistic annotation or if the probe 'itself' learned to predict this annotation. They introduced control tasks which associate word types with random outputs, and each word token is assigned its type's output, regardless of context. By construction, such tasks can only be learned by the probe itself. They argue that selectivity, i.e. difference between linguistic task accuracy and control task accuracy, reveals how much the linguistic probe relies on the regularities encoded in the representations. They propose to tune probe hyperparameters so that to maximize selectivity. In contrast, we will show that MDL probes do not require such tuning.

Experimental Setting
In all experiments, we use the data and follow the setting of Hewitt and Liang (2019); we build on top of their code and release our extended version to reproduce the experiments.
In the main text, we use a probe with default hyperparameters which was a starting point in Hewitt and Liang (2019) and was shown to have low selectivity. In the appendix, we provide results for 10 different settings and show that, in contrast to accuracy, codelength is stable across settings.
Task: part of speech. Control tasks were designed for two tasks: part-of-speech (PoS) tagging and dependency edge prediction. In this work, we focus only on the PoS tagging task, the task of assigning tags, such as noun, verb, and adjective, to individual word tokens. For the control task, for each word type, a PoS tag is independently sampled from the empirical distribution of the tags in the linguistic data.
Data. The pretrained model is the 5.5 billionword pre-trained ELMo (Peters et al., 2018). The data comes from Penn Treebank (Marcus et al., 1993) with the traditional parsing training/development/testing splits 5 without extra preprocessing. Table 1 shows dataset statistics.
Probes. The probe is MLP-2 of Hewitt and Liang (2019) with the default hyperparameters. Namely, it is a multi-layer perceptron with two hidden layers defined as: ; hidden layer size h is 1000 and no dropout is used. Additionally, in the appendix, we provide results for both MLP-2 and MLP-1 for several h values: 1000, 500, 250, 100, 50.
For the variational code, we replace dense layers with the Bayesian compression layers from Louizos et al. (2017); the loss function changes to Eq. (3).
Optimizer. All of our probing models are trained with Adam (Kingma and Ba, 2015) with learning rate 0.001. With standard probes, we follow the original paper (Hewitt and Liang, 2019) and anneal the learning rate by a factor of 0.5 once the epoch does not lead to a new minimum loss on the development set; we stop training when 4 such epochs occur in a row. With variational probes, we do not anneal learning rate and train probes for 200 epochs; long training is recommended to enable pruning (Louizos et al., 2017).

Experimental Results
Results are shown in   Table 2: Experimental results; shown in pairs: linguistic task / control task. Codelength is measured in kbits (variational codelength is given in equation (3), online -in equation (4)). Accuracy is shown for the standard probe as in Hewitt and Liang (2019); for the variational probe, scores are similar (see Table 3). Different compression methods, similar results. First, we see that both compression methods show similar trends in codelength. For the linguistic task, the best layer is the first one. For the control task, codes become larger as we move up from the embedding layer; this is expected since the control task measures the ability to memorize word type. Note that codelengths for control tasks are substantially larger than for the linguistic task (at least twice larger). This again illustrates that description length is preferable to probe accuracy: in contrast to accuracy, codelength is able to distinguish these tasks without any search for settings.
LAYER 0: MDL is correct, accuracy is not.
What is even more surprising, codelength identifies the control task even when accuracy indicates the opposite: for LAYER 0, accuracy for the control task is higher, but the code is twice longer than for the linguistic task. This is because codelength characterizes how hard it is to achieve this accuracy: for the control task, accuracy is higher, but the cost of achieving this score is very big. We will illustrate this later in this section.
For the linguistic task, note that codelength for the embedding layer is approximately twice larger than that for the first layer. Later in Section 4 we will see the same trends for several other tasks, and will show that even contextualized representations obtained with a randomly initialized model are a lot better than with the embedding layer alone.
Model: small for linguistic, large for control. Figure 2(a) shows data and model components of the variational code. For control tasks, model size is several times larger than for the linguistic task. This is something that probe accuracy alone is not able to reflect: representations have structure with respect to the linguistic labels and this structure can be 'explained' with a small model. The same representations do not have structure with respect to random labels, therefore these labels can be predicted only using a larger model. Using interpretation from Section 2.3 to split  the online code into data and model codelength, we get Figure 2(b). The trends are similar to the ones with the variational code; but with the online code, the model component shows how easy it is to learn from small amount of data: if the representations have structure with respect to some labels, this structure can be revealed with a few training examples. Figure  Architecture: sparse for linguistic, dense for control. The method for the variational code we use, Bayesian compression of Louizos et al. (2017), lets us assess the induced probe complexity not only by using its description length (as we did above), but also by looking at the induced architecture (Table 3). Probes learned for linguistic tasks are much smaller than those for control tasks, with only 33-75 neurons at the second and third layers. This relates to previous work (Hewitt and Liang, 2019). The authors considered several predefined probe architectures and picked one of them based on a manually defined criterion. In contrast, the variational code gives probe architecture as a byproduct of training and does not need human guidance.

Stability and Reliability of MDL Probes
Here we discuss stability of MDL results across compression methods, underlying probing classifier setting and random seeds.
The two compression methods agree in results.
Note that the observed agreement in codelengths Figure 3: Results for 10 probe settings: accuracy is wrong for 8 out of 10 settings, MDL is always correct (for accuracy higher is better, for codelength -lower).
from different methods (Table 2) is rather surprising: this contrasts to Blier and Ollivier (2018), who experimented with images (MNIST, CIFAR-10) and argued that the variational code yields very poor compression bounds compared to online code. We can speculate that their results may be due to the particular variational approach they use. The agreement between different codes is desirable and suggests sensibility and reliability of the results.
Hyperparameters: change results for accuracy, do not for MDL. While here we will discuss in detail results for the default settings, in the appendix we provide results for 10 different settings; for LAYER 0, results are given in Figure 3. We see that accuracy can change greatly with the settings. For example, difference in accuracy for linguistic and control tasks varies a lot; for LAYER 0 there are settings with contradictory results: accuracy can be higher either for the linguistic or for the control task depending on the settings (Figure 3). In striking contrast to accuracy, MDL results are stable across settings, thus MDL does not require search for probe settings.
Random seed: affects accuracy but not MDL. We evaluated results from Table 2 for random seeds from 0 to 4; for the linguistic task, results are shown in Figure 2(d). We see that using accuracy can lead to different rankings of layers depending on a random seed, making it hard to draw conclusions about their relative qualities. For example, accuracy for LAYER 1 and LAYER 2 are 97.48 and 97.31 for seed 1, but 97.38 and 97.48 for seed 0. On the contrary, the MDL results are stable and the scores given to different layers are well separated. Note that for this 'real' task, where the true ranking of layers 1 and 2 is not known in advance, tuning a probe setting by maximizing difference with the synthetic control task (as done by Hewitt and Liang (2019)) does not help: in the tuned setting, scores for these layers remain very close (e.g., 97.3 and 97.0 (Hewitt and Liang, 2019)).

Description Length and Random Models
Now, from random labels for word types, we come to another type of random baselines: randomly initialized models. Probes using these representations show surprisingly strong performance for both token (Zhang and Bowman, 2018) and sentence (Wieting and Kiela, 2019) representations. This again confirms that accuracy alone does not reflect what a representation encodes. With MDL probes, we will see that codelength shows large difference between trained and randomly initialized representations.
In this part, we also experiment with ELMo and compare it with a version of the ELMo model in which all weights above the lexical layer (LAYER 0) are replaced with random orthonormal matrices (but the embedding layer, LAYER 0, is retained from trained ELMo). We conduct a line of experiments using a suite of edge probing tasks (Tenney et al., 2019). In these tasks, a probing model (Figure 4) can access only representations within given spans, such as a predicate-argument pair, and must predict properties, such as semantic roles.

Experimental Setting
Tasks and datasets. We focus on several core NLP tasks: PoS tagging, syntactic constituent and dependency labeling, named entity recognition, se-Figure 4: Probing model architecture for an edge probing task. The example is for semantic role labeling; for PoS, NER and constituents, only a single span is used. mantic role labeling, coreference resolution, and relation classification. Examples for each task are shown in Table 4, dataset statistics are in Table 5. See extra details in Tenney et al. (2019).
Probes and optimization. Probing architecture is illustrated in Figure 4. It takes a list of contextual vectors [e 0 , e 1 , . . . , e n ] and integer spans s 1 = [i 1 , j 1 ) and (optionally) s 2 = [i 2 , j 2 ) as inputs, and uses a projection layer followed by the self-attention pooling operator of Lee et al. (2017) to compute fixed-length span representations. The span representations are concatenated and fed into a two-layer MLP followed by a softmax output 8.4 / 9.7 3.40 / 2.92 8.6 / 11.7 3.3 / 2.4 Table 6: Experimental results; shown in pairs: trained model / randomly initialized model. Codelength is measured in kbits (variational codelength is given in equation (3), online -in equation (4)), compression -with respect to the corresponding uniform code. layer. As in the original paper, we use the standard cross-entropy loss, hidden layer size of 256 and dropout of 0.3. For further details on training, we refer the reader to the original paper by Tenney et al. (2019). 7 For the variational code, the layers are replaced with that of Bayesian compression by Louizos et al. (2017); loss function changes to (3) and no dropout 7 The differences with the original implementation by Tenney et al. (2019) are: softmax with the cross-entropy loss instead of sigmoid with binary cross-entropy, using the loss instead of F1 in the early stopping criterion. is used. Similar to the experiments in the previous section, we do not anneal learning rate and train at least 200 epochs to enable pruning.
We build our experiments on top of the original code by Tenney et al. (2019) and release our extended version.

Experimental Results
Results are shown in Table 6. LAYER 0 vs contextual. As we have already seen in the previous section, codelength shows dras-tic difference between the embedding layer (LAYER 0) and contextualized representations: codelengths differ about twice for most of the tasks. Both compression methods show that even for the randomly initialized model, contextualized representations are better than lexical representations. This is because context-agnostic embeddings do not contain enough information about the task, i.e., MI between labels and context-agnostic representations is smaller than between labels and contextualized representations. Since compression of the labels given model (i.e., data component of the code) is limited by the MI between the representations and the labels (Section 2.1), the data component of the codelength is much bigger for the embedding layer than for contextualized representations.
Trained vs random. As expected, codelengths for the randomly initialized model are larger than for the trained one. This is more prominent when not just looking at the bare scores, but comparing compression against context-agnostic representations. For all tasks, compression bounds for the randomly initialized model are closer to those of context-agnostic LAYER 0 than representations from the trained model. This shows that gain from using context for the randomly initialized model is at least twice smaller than for the trained model.
Note also that randomly initialized layers do not evolve: for all tasks, MDL for layers of the randomly initialized model is the same. Moreover, Table 7 shows that not only total codelength but data and model components of the code for random model layers are also identical. For the trained model, this is not the case: LAYER 2 is worse than LAYER 1 for all tasks. This is one more illustration of the general process explained in Voita et al. (2019a): the way representations evolve between layers is defined by the training objective. For the randomly initialized model, since no training objective has been optimized, no evolution happens.

Related work
Probing classifiers are the most common approach for associating neural network representations with linguistic properties (see Belinkov and Glass (2019) for a survey). Among the works highlighting limitations of standard probes (not mentioned earlier) is the work by Saphra and Lopez (2019), who show that diagnostic classifiers are not suitable for understanding learning dynamics.
In addition to task performance, learning curves have also been used before by Yogatama et al. (2019) to evaluate how quickly a model learns a new task, and by Talmor et al. (2019) to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
An information-theoretic view on analysis of NLP models has been previously attempted in Voita et al. (2019a) when explaining how representations in the Transformer evolve between layers under different training objectives.

Conclusions
We propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. We show that MDL naturally characterizes not only probe quality, but also 'the amount of effort' needed to achieve it (or, intuitively, strength of the regularity in representations with respect to the labels); this is done in a theoretically justified way without manual search for settings. We explain how to easily measure MDL on top of standard probe-training pipelines. We show that results of MDL probing are more informative and stable compared to the standard probes.  Table 8: Experimental results; shown in pairs: linguistic task / control task. Codelength is measured in kbits (variational codelength is given in equation (3), online -in equation (4)). h is the probe hidden layer size.
A.2 Random seeds: control task Results are shown in Figure 5.