Pareto Probing: Trading-Off Accuracy and Complexity

The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.


Introduction
Neural networks are a pillar of modern NLP systems. However, their inner workings are poorly understood; indeed, for this reason, they are often referred to as black-box systems (Psichogios and Ungar, 1992;Orphanos et al., 1999;Cauer et al., 2000). This lack of understanding, coupled with the rising adoption of neural NLP systems in both industry and academia, has fomented a rapidly growing literature devoted et al., 2019). One popular method for studying the linguistic content of neural networks is probing, which we define in this work as training a supervised classifier (known as a probe) on top of pretrained models' frozen representations (Alain and Bengio, 2017). By analyzing the classifier's performance, one can assess how much 'knowledge' the representations contain about language. Much work in probing advocates for the need for simple probes (Hewitt and Manning, 2019;Hall Maudslay et al., 2020). Indeed, on this point, Alain and Bengio (2017) write: "The task of a deep neural network classifier is to come up with a representation for the final layer that can be easily fed to a linear classifier (i.e. the most elementary form of useful classifier)." as a justification for their operationalization of complexity as the restriction of the probe to linear models (as opposed to deep neural networks). Most saliently, Hewitt and Liang (2019) attempts to operationalize complexity in terms of control tasks, which constrain a probe's capacity for memorization. 1 Voita and Titov (2020) follow in this vein with an information-theoretic estimate of complexity: a model's minimum description length.
In opposition to the complexity of a probe is its accuracy, i.e., its ability to perform the target probing task. From an information-theoretic perspective, Pimentel et al. (2020) argues for the use of more complex probes, since they better estimate the amount of mutual information between a representation and the target linguistic property. From a different perspective, Saphra and Lopez (2019) also criticize the indiscriminate use of simple probes, because most neural representations are not estimated with the explicit aim of making information linearly separable; thus, it is unlikely that they will naturally do so, and foolish, perhaps, to expect them to.
This paper proposes to directly acknowledge the existence of a trade-off between the two when considering the development of probes. We argue-in part based on experimental evidence-that naïvely selecting a family of probes either for its complexity or its performance leads to degenerate edgecases; see Fig. 1. We conclude that the nuanced trade-off between accuracy and complexity in probing should thus be treated as a bi-objective optimization problem: One objective encourages low complexity and another encourages high accuracy. We then propose a novel evaluation paradigm for probes. We advocate for Pareto optimal probes, i.e., probes that are both simpler and more accurate than all others. The set of such optimal probes can then be taken in aggregate to form a Pareto frontier, which allows for broader analysis and easier comparison between representations.
We run a battery of probing experiments for partof-speech labeling and dependency-arc labeling, using both parametric and non-parametric complexity metrics. Our experiments show that if we desire simple probes, then we are forced to conclude that one-hot encoding representations and randomly generated ones almost always encode more linguistic structure than those representations derived from BERT-a nonsensical result. On the 1 Hewitt and Liang (2019) define selectivity as the difference between a model's accuracy on a task versus its accuracy on a control version of that task. The control version of the tasks are built by randomly shuffling labels across word types and measures a probe's capacity for memorization. Our nonparametric measures of complexity differ from control tasks; we describe these differences in § 5. other hand, seeking the most accurate probes is equivalent to performing NLP task-based research (e.g. part-of-speech tagging) in the classic way. We contend our Pareto curve-based measurements strike a reasonable balance.
To wrap up our paper, we levy a criticism at the probing tasks themselves; we argue that "toyish" probing tasks are not very useful for revealing how much more linguistic information BERT manages to capture than standard baseline representations. With this in mind, we advocate for more challenging probing tasks, e.g., dependency parsing instead of its toyish cousin dependency arc labeling. We find that using actual NLP tasks as probing tasks reveals much more about the advantages BERT provides over non-contextual representations.

Performance and Complexity
We argue in favor of treating probing for linguistic structure in neural representations as a two part optimization problem. On the one hand, we must optimize our probe for high accuracy on our chosen probing task: If we do not directly train the probe to accurately extract the linguistic features from the representation, how else can we determine whether they are implicitly encoded? On the other hand, the received wisdom in the probing community is that probes should be simple (Alain and Bengio, 2017;Hewitt and Manning, 2019): If the probe is an overly complex model, we might ascribe high accuracy on the probing task to the probe itself, meaning the probe has "learned the task" to a large extent. In this section, we argue that a probing framework that does not explicitly take into account the accuracy-complexity trade-off may be easily gamed. Indeed, we demonstrate how to game both accuracy and complexity respectively below.

The Nature of Probing Tasks
Most probing tasks are relatively "toy" in nature (Hupkes et al., 2018). 2 For instance, two of the most common probing tasks are part-of-speech labeling (POSL; Hewitt and Liang, 2019;Belinkov et al., 2017) and dependency arc labeling (DAL; Tenney et al., 2019a,b;Voita and Titov, 2020). Both tasks are treated as multi-way classification problems. POSL requires a model to assign a part-of-speech tag to a word in context without modeling the entire sequence of part-of-speech tags. Likewise, DAL requires a model to assign a dependency-arc label to an arc independently of the larger dependency tree. These word-oriented probing approaches force models to rely on information about context indirectly encoded in the feature vectors generated by the probed model. Importantly, both are simplified versions of their structured prediction cousins-part-of-speech tagging and dependency parsing-which require the modeling of entire sentences. Accuracy on POSL and DAL is then considered indicative of probed representations' "knowledge" of the linguistic structure encoded in the probing task. Limiting explicit access to context therefore allows an analysis constrained to how context is implicitly encoded in a particular representation. Furthermore, because POSL and DAL do not require complex structured prediction models, their simplicity is seen as a virtue to the mindset of disfavoring complexity (discussed further in § 2.3).

Optimizing for Performance
We first will argue that it is problematic to judge a probe either only by its performance on the probing task or by its complexity. Pimentel et al. (2020) showed that, under a weak assumption, any contextualized representation contains as much information about a linguistic task as the original sentence. They write: "under our operationalization, the endeavour of finding syntax in contextualized embeddings sentences is nonsensical. This is because, under Assumption 1, we know the answer a priori." We agree that under their operationalization probing is nonsensical-purely optimizing for performance does not tell us anything about the representations, but only about the sentence itself.
Researchers, of course, have realized that choosing the most accurate probe is not wise for analysis; see Hewitt and Manning (2019) and the references therein for a good articulation of this point. To compensate for this tension, researchers have imposed explicit restrictions on the complexity of the probe, resulting in wider differences between contextual and non-contextual representations. Indeed, this is the logic behind the study of Hewitt and Liang (2019) who argue that selective probes should be chosen to judge whether the target linguistic property is well encoded in the representations. Relat-edly, other researchers have explicitly focused on linear classifiers as probes with the explicit reasoning that linear models are simpler than non-linear ones (Alain and Bengio, 2017;Hewitt and Manning, 2019;Hall Maudslay et al., 2020).

Reducing a Probe's Complexity
In § 2.2, we argued that solely optimizing for accuracy does not lead to a reasonable probing framework. Less commonly discussed, however, is that we also cannot directly optimize for simplicity. Let us consider the POSL probing task and the case where we are using a linear model as our probabilistic probe: where t ∈ T is the target, e.g. a universal partof-speech tag (Petrov et al., 2012), h ∈ R d is a contextual embedding and W ∈ R |T |×d is a linear projection matrix. A natural measure of probe complexity in this framework is the rank of the projection matrix: rank(W). Indeed, this complexity metric was considered in one of the experiments in Hewitt and Manning (2019) to show that BERT representations strictly dominate ELMo representations for all ranks in their analyzed task. That experiment, though, left out some important baselines-the simplest of which is the encoding of words as one-hot representations. We take inspiration from those experiments and expand upon them (but rely instead on the nuclear norm as a convex relaxation of the matrix rank § 4) to produce the more complete plots in Figs. 2 and 3. These results are quite stunning; they show that, if we only cared about representations that simple probes could extract linguistic properties from, then a one-hot encoding of the word types is the best choice.
It is easy to see why the one-hot encoding does so well. For many of the toy probing tasks, the identity of the word is the single most important factor. It seems natural to expect that a low-complexity probe will be unable to exploit much more than a word's identity, so a one-hot embedding is really the best you can do-the word's identity is trivially encoded. Our point here is that both accuracy and complexity matter and neither can be sensibly optimized without the other.

An Invitation to Pareto Probing
We now advocate for a probing evaluation metric that combines both accuracy and complexity. We argued in § 2 that probe accuracy and complexity exist in a trade-off. Because of this trade-off, we should search for models that are Pareto optimal. A probe is considered Pareto optimal (with respect to a family of probes) if there is no competing probe where both the accuracy is higher and the complexity is lower on the task. The set of Pareto optimal points may be called the Pareto frontier and is generally connected, as is shown in Fig. 2. As can also be seen in Fig. 2, we can compare different representations according to their Pareto frontiers. The set of representations that appear on the Pareto frontier should be sufficient-the other representations are Pareto dominated, since you can improve in one aspect (complexity or accuracy) without sacrificing the other. We call the set of representations which are on the frontier Pareto dominant.
We can also analyze each representation's frontier individually. This notion leads us to a very natural metric for evaluating probes: Pareto hypervolume (PH; Auger et al., 2012). 3 One important technical caveat involving evaluating the hypervolume is that it is undefined when the metric of model complexity for the experiment is unbounded. Thus, it is necessary to restrict model complexity to a bounded interval so that the PH is always finite.

Parametric Metrics of Complexity
We consider two types of probe complexity metrics. We term the first parametric complexity, which we discuss in this section. The second type is nonparametric complexity, which we discuss in § 5. For the parametric one, we first require a family of probes, e.g. the family of linear probes-which are all those that take the form of eq. (1), without restriction on the representation's dimension d.

Parametric Complexity for Linear Probes
In the case of linear probes, we explore two metrics of parametric complexity: the nuclear norm and rank. The nuclear norm is defined as where σ i (W) is the i th singular value of Wwhich, in a way, measures the "size" of the matrix.
This yields the following objective for λ ≥ 0: Training a probe to minimize this objective is equivalent to trading off its performance (high likelihood on the training data) for a lower complexity (nuclear norm of W). This trade-off can be controlled through the hyper-parameter λ.
As a parametric complexity metric, we also consider the rank of the matrix. One definition of a matrix's rank is the number of non-zero singular values σ i (W). The rank can easily be restricted to a maximum value r ∈ N + by splitting the matrix in two W = W l W r , where W l ∈ R r×|T | and W r ∈ R r×d . The nuclear norm is the tightest convex relaxation of the rank (Recht et al., 2010). 4 While low-rank regularization is assumed to produce models that generalize better (Hinton and Van Camp, 1993;Langenberg et al., 2019), contrary to the classic bias-variance tradeoff, Goldblum et al. (2020) found that biasing towards small nuclear norms instead hurts generalization. Furthermore, our probe family consists of linear transformations, which are fed a relatively small number of features and trained with large training sets. As such, we are in an underfitting situation and any regularization should indeed hurt test performance.

Relation to Minimum Description Length
A recent proposal by Voita and Titov (2020) suggests that minimum description length (MDL; Rissanen, 1978) is a useful approach to the problem of balancing performance and complexity. The idea behind MDL is analogous to that of Bayesian evidence: We have a family of probabilistic models and a prior over those models. The likelihood term tells us how well we have coded the data and the prior term tells us the length of the model's code: If we define our distribution over matrices as we recover our nuclear norm complexity term as the log of the prior. The distribution defined in eq. (5) is mathematically equivalent to the matrix normal distribution. To show this, we note that − λ 2 ||W|| 2 * = − 1 2 tr W λ −1 I −1 W and present the definition of the matrix normal (Gupta and Nagar, 2018, Chapter 2) as where k = |T | and d are the sizes of matrix W, M is the matrix's expected value, and V and U are analogous to the covariance matrices of typical Gaussian distributions. By setting the mean to the zero matrix, V to I (the identity matrix) and U to λ −1 I, we recover eq. (5). Naturally, there are many extensions within the MDL framework, e.g. variational coding MDL (Blier and Ollivier, 2018). In the case of non-linear models, Bayesian neural networks (Neal, 2012) are a natural choice. However, a fundamental problem will always remain-the results are dependent on the choice of prior. Indeed, in the simple case of linear probes, we can always "hack" the prior to favor certain probes over others that may not correspond to our intuitions of model complexity. For this reason, we also analyze a set of non-parametric metrics of complexity that do not require the probe user to pre-specify a prior over models.

Non-Parametric Metrics of Complexity
The parametric metrics of model complexity in § 4 have an explicit constraint that the models must belong to the same parametric family. Specifically, it requires that we are able to define a penalty (generally dependent on the parameters) that enforces how complex each model should be. In this section, we move away from parametric notions of model complexity to non-parametric metrics.
We opt to work with a notion of non-parametric complexity based on the ease with which a model can memorize training data. These non-parametric measures are rarely explicitly discussed as complexity metrics-although they are intuitive for that purpose-and have become common recently: Zhang et al. (2017) originated this trend by shuffling outputs of image data so the images were no longer predictive of the labels, using this result to illustrate the effective memorization capacity of modern neural networks. The first of our two experiments to obtain non-parametric complexity measures is similar to theirs. We train our probe in a dataset with shuffled labels and get its accuracy in this training set. We will refer to this complexity metric as the label-shuffled scenario. 5 Neural networks can take advantage of structured input (e.g. real images as opposed to noisy ones) to easily memorize their labels (Zhang et al., 2017). These structured inputs may be easier to represent internally regardless of the outputs, given current theories that early stages of training are committed to memorizing inputs (Arpit et al., 2017). As such, we may also want to analyze a probe's capacity to memorize unstructured inputin the case of language, we can easily remove structure by shuffling the word sequences themselves, creating random Zipfian-distributed noise, which are harder for neural networks to exploit (Liu et al., 2018). By providing probes with unstructured input, we measure a more domain-independent sense of complexity than the ability to map structured inputs to random labels, because the model cannot rely on syntactic patterns when memorizing shuffled training data. We will refer to this second scenario, wherein both labels and inputs are shuffled, as fully shuffled.
The distinction between memorization of real data and memorization of unstructured data is crucial, as experimenters choose the class of probes being learned. A comparison between label-shuffled and fully shuffled compression exposes the degree to which the class of probes employs a bias towards the true input structure in its compression. Similarly, comparisons between different classes of probes can test the same assumed bias.
We highlight that, while our non-parametric complexity metrics permit arbitrary classes of probes to be included in a probe hypothesis space, the selection criteria of possible probes may still reflect the assumed structure of the data, affecting compression and generalization. For example, linear probes reflect an assumption that the information lies in an Euclidean (sub)space; however, this assumption may not be true: Reif  spiral manifold in BERT, while the syntactic distances described by Hewitt and Manning (2019) are Pythagorean in nature. One advantage of these methods is the ability to compare between probe classes, which offers a test of the geometric assumptions behind model selection. 6 Another advantage is that, unlike regularization-based parametric methods, they require no modification of the training procedure and can therefore run much faster.

POSL and DAL Experiments
We present our experimental findings on the previously discussed part-of-speech labeling (POSL) and dependency arc labeling (DAL) probing tasks using both our parametric complexity metrics ( § 4) and the non-parametric ones ( § 5). . When investigating POSL, we take the target space T to be the set of universal part-of-speech tags for a specific language. We then train a classifier to predict these POS tags from word representations obtained from the 6 Another non-parametric method, online coding MDL (Voita and Titov, 2020) can likewise be compared across arbitrary model classes, because its complexity metric is based on probabilities produced and not probe parameters. analyzed model (e.g., BERT). Similarly, for DAL, the target space T is defined as the set of arc dependency labels in the language, but we predict these labels from pairs of representations-the two words composing the arc.
We analyze the contextual representations from BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020) and RoBERTa (Liu et al., 2019)-noting that ALBERT and RoBERTa are trained in English alone, so we only evaluate their performance on that language. 7 For each of these models, we feed it a sentence and average the output word piece (Wu et al., 2016) representations for each word, as tokenized in the treebank. We further analyze fast-Text's non-contextual representations (Bojanowski et al., 2017) as well as one-hot and random representations such as those considered by Pimentel et al. (2020). One-hot and random representations map each word type in the training data to a vector we sample from a standard normal distribution (zero mean and unit variance). New representations are sampled on the spot (untrained) for any out of vocabulary words. All representations are kept fixed during training, except for one-hot, which are learned with the other network parameters.

Linear Probes with Norm Constraints
For each language-representation-task triple, we train 100 linear probes, 50 optimizing eq.   Fig. 3 show the nuclear norm experiments in other languages. 9 As discussed in § 2.3, optimizing for complexity alone leads to trivial results-in all these languages onehot representations would result in the best accuracy when using the nuclear norm complexity metric. We show that, counter-intuitively, fastText and one-hot representations Pareto-dominate BERT on the POSL task in Basque, Finnish and Turkish, producing higher accuracies with probes of any complexity (as defined by their nuclear norms). Thus, from the POSL experiments we cannot conclude BERT has any more syntactic information. In English, the one-hot and ALBERT representations form the Pareto-dominant set; the former in the simple scenario and the later in the complex scenario.

MLPs and Memorization
When using our non-parametric complexity metrics, we again train a number of classifiers for each 9 Since rank constrained results showed a similar trend to the nuclear norm ones, results for other languages were moved into the appendices. The interested reader will also find zoomed-in versions (in the y-axis) of these plots there, as well as Pareto hypervolume tables. language-representation-task triple. The classifiers chosen for this analysis were multilayer perceptrons (MLP) with ReLU non-linearities. We trained 50 MLPs for each language-representationtask triple, sampling the number of layers uniformly from [0, 5], the dropout from [0.0, 0.5], and the hidden size log-uniformly from [2 5 , 2 10 ]. Note that zero layers is a linear probe. Each of these architectures was trained both on the standard training set as in this set's label-shuffled and fully shuffled alternatives. Fig. 4 presents POSL and DAL multilingual results under the non-parametric complexity metric; the right half of Fig. 2 presents English results. 10 The most striking characteristic of the Pareto frontiers is how simple architectures (i.e. with relatively low memorization capacity) achieve as high an accuracy as the more complex ones. This is not surprising, though, when we compare this finding to the parametric ones; there we see linear probes are already almost as good as MLPs on these tasks. We take this as support for our intuition that toyish probing tasks are not very interesting or informative. We discuss this point in the next section.

The False Promise of Toy Probing Tasks
In § 2.1, we reviewed arguments that researchers have put forth to justify toy tasks, while the argument for toy tasks from a standpoint of model complexity is addressed in § 2.3. Nevertheless, BERT, ELMo and other pre-trained representations rose to fame based on their ability to boost neural models to human-level scores on large, non-trivial tasks, e.g. natural language inference (Liu et al., 2019) and question answering (Lan et al., 2020)-with different performance patterns being observed on the toyish probing tasks. As reported by Pimentel et al. (2020), BERT embeddings do not yield substantial improvements over non-contextual-embedding baselines, e.g. fast-Text, on toyish probing tasks. We reproduce similar experiments, albeit with our methodology, in § 6.2. In the case of POSL, we observe that fastText's embeddings achieve higher accuracy in many cases. In the case of DAL, however, we do observe that BERT leads to relatively small improvements over fastText across a typologically diverse set of languages. This result is not surprising because DAL is a more complex task than POSL: When one probes on simple tasks, models pretrained on more data do not help much. Furthermore, a quick visual analysis of § 6.1 reveals that one can achieve relatively high accuracy on both POSL and DAL with a linear probe. This is confirmed by § 6.2, which shows that simple probes, i.e. probes with less capacity for memorization, result in as high accuracy as complex ones. In fact, we run an extra experiment, shown in Tab. 1, which shows that a trivial dictionary lookup strategy (details are presented in App. B) already achieves relatively high accuracies in POSL in all languages.
We interpret this to mean that current probing tasks are uninteresting-hiding from us the amount of syntactic information contextual representations actually encode. Furthermore, the simplicity of such toyish tasks artificially makes type-level embeddings-e.g. fastText-seem nearly as good as contextual ones.

Dependency Parsing
Following the previous argument, we believe harder probing tasks should be used. We take the lead by looking at dependency parsing, which   depends on the whole sentence's context, and is much harder than toyish tasks like POSL and DAL. We train a simplified version of Dozat and Manning's (2017) biaffine parser, removing its power to process context by discarding its LSTM-as we describe in detail in App. A. This parser gives us the probability of a head for each word in a sentence, which allows us to recover the whole dependency tree. We then evaluate these trees using unlabeled attachment score (UAS). For our labelshuffled experiments, we permute the heads per sentence-creating non-tree dependencies. Figs. 1 and 5 present label-shuffled results for this task. Such figures are much more interesting than the POSL and DAL ones, showing the expected trade-off between accuracy and complexity. Tab. 2 makes the amount of syntax encoded in contextual representations much clearer when compared to fastText. This is specially true if we compare these results to the Pareto hypervolumes of the POSL and DAL tasks (presented in Tab. 3 in the appendix). We take this experiment to conclude two things: (i) harder tasks are necessary to study neural representations; (ii) contextual representations encode much more knowledge about syntax (as expected) then do non-contextual ones.

A Closer Look at Model Complexity
This work represents a new entry into a growing literature on taking the capabilities of probes into account when analyzing a model (Hewitt and Liang, 2019;Voita and Titov, 2020;Whitney et al., 2020). The fundamental point we wish to espouse in this paper is that evaluating a probe for linguistic structure is fundamentally asking a question about a trade-off between accuracy and complexity. However, we wish to highlight that evaluating a probe's complexity is a very open problem. Indeed, the larger question of model complexity has been treated for over 50 years in a number of disciplines. In statistics, model complexity is researched in the model selection literature, e.g. the classical techniques of Bayesian information criterion (Schwarz, 1978) and Akaike information criterion (Akaike, 1974). In computer science, learning theorists have introduced the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis, 1971), Pollard's pseudo-dimension (Pollard, 1984), and Rademacher complexity (Bartlett and Mendelson, 2002). Algorithmic information theorists provide Kolmogorov complexity (Kolmogorov, 1963)-closely related to MDLencoding the size of the model.
A concrete discussion of complexity requires several distinctions regarding these measures. The first is the object of analysis of the complexity measure which can be either a model family, i.e the whole set of functions realizable by a choice of architecture and hyperparameters, or a learned model, which takes into account a specific set of learned parameters. The second is the aspect being analyzed by the used measure which could be, for example, the capacity of the function class (i.e., whether it is possible to set model weights so the architecture represents a specific target function-its hard constraints) or its bias (i.e., the soft constraints that influence whether the training process is likely to guide a model towards this target function).
Importantly, the measure of complexity the scientist employs will impact their scientific findings about how much linguistic structure they read into a neural network's hidden states. For instance, in some of our experiments we regularize the probe directly with a relaxation of a nuclear norm constraint, thus imposing a bias without modifying the total capacity of the model. Meanwhile, our rank-based method controls the capacity of the probe directly, enforcing a model family's complexity. Finally, our non-parametric methods, and selectivity, estimate the capacity of a model family (under a specific hyperparameter choice) by approximating a "hard" function in which labels are randomly assigned-differences in accuracy between these three complexity measures indicate a complex relationship between implicit regularization; input language structure; and model capacity. In comparison to our computable complexity measures, popular hypothetical notions also vary in how they explore the data's domain space to analyze a model family: Rademacher complexity uses true observations as inputs, but VC dimension considers adversarially selected data-being defined according to the "worst" possible sample allowed in the input domain.
As new techniques for considering the complexity of models emerge, it is critical to develop in parallel tools for reasoning about what aspect and object of "complexity" is really being measured. When one introduces a metric as modeling complexity, it can be explicitly situated within such a taxonomy; these considerations should be made explicit in the presentation. A sufficiently developed theory of probing will reveal not only the information contained in a representation, but the underlying geometry of the representation space, by comparing the performance of different model families. Such developments are left to future work.

Conclusion
In this paper, we argued for a new approach to probing, treating it as a bi-objective optimization problem. It has no single optimal solution, but can be analyzed-under the lens of Pareto efficiencyto arrive on a set of optimal solutions. These Pareto optimal solutions make explicit the trade-off between accuracy and complexity, also permitting for a deeper analysis of the probed representations.
The second part of our paper argues that we need to select harder tasks for the purpose of probing representations for syntactic knowledge. For tasks such as POSL or DAL, which require only shallow notions of syntax, non-contextual representations can do almost as well as contextual onespretraining on large amounts of data, or encoding contextual knowledge in the representations, does not help much for these tasks. We then run a battery of experiments on the harder task of dependency parsing; these show that contextual representations indeed provide much more useable syntactic knowledge than non-contextual ones.

Appendices A Dependency Parser
Our dependency parser is inspired in Dozat and Manning's (2017) biaffine parser. We simplify it though, e.g. by not giving it an LSTM, to reduce its complexity and restrict its access to context. While looking at whole sentences [h 0 , . . . , h |s| ], we train two MLPs with the same architecture as in § 6.2one is used to process heads of dependencies, while the other is used for tails-and pass each individual token representation through both.
We further define a biaffine transformation, through which we pass both representations.
Finally, the output of this biaffine projection are then used as logits, which we normalize to get the probabilities of all possible heads for a specific word j.
Such a probability distribution allows us to recover the whole dependency tree.

B Lookup Model
In this section, we present the design of a very simple lookup model for the POSL and DAL tasks. We present the detailed implementation of both models below, but their general idea is looking at the training set for an instance's most frequent label, and falling back to an overall general label frequency in case it is not found.
POSL In this task, the lookup model has two behaviors: (i) for a word which appear in the training set, it guesses its most common label; (ii) for an out of vocabulary word, it guesses the most overall frequent label in the training set. DAL In the DAL task, the lookup model has four behaviors: (i) for an arc which appear in the training set, it guesses its most common label; (ii) for an unknown arc, it guesses the most overall frequent label in the training set for the arc's tail word; (iii) if the tail of the arc is an an out of vocabulary word, it guesses the most frequent label in the training set for the head word; (iv) finally, if the head is also out of vocabulary word, it guesses the most overall frequent arc label in the training set.

C Detailed results
In this section, we present further results which did not fit into the main text. We initially present results in dependency parsing, using the Nuclear Norm as our parametric complexity metric. Figs. 7 and 8 show that, again, the one-hot representation produces the best results in highly constrained scenarios (i.e. with very simple probes). Furthermore, comparing this results with § 7.2 we see that linear probes cannot do as well as MLPs in this task; suggesting it is indeed harder. Fig. 6 presents fully shuffled results for dependency parsing in English.
Comparing it to § 7.2 we see the probes' capacity to memorize is much smaller on unstructured input. Tab. 3 presents the Pareto hypervolume for all the analyzed models, in all languages, for POSL and DAL. Analyzing this table we again see that contextual representations do not improve over the non-contextual ones on these tasks by much-even producing worse results in some.
Finally, Fig. 9 presents results using the Rank parametric complexity metric, while Fig. 10 presents zoomed-in (in the y-axis) results for the Nuclear Norm parametric complexity metric.     Figure 8: Pareto curves on dependency parsing with the Nuclear Norm complexity metric in a diverse set of languages. The x-axis corresponds to the complexity, while the y-axis measures the probes performance on the task. Since the nuclear norm is unbounded, we maxed it to 700 in the parsing task. Probing the representations: BERT, fastText, one-hot, and random. Turkish Figure 10: Zoomed in (in the y-axis) pareto curves on POSL and DAL with the Nuclear Norm complexity metric. The x-axis corresponds to the complexity, while the y-axis measures the probes performance on the task. In this plot we do not max the nuclear norm, showing its full range. Probing the representations: BERT, fastText, one-hot, and random.