A Tale of a Probe and a Parser

Measuring what linguistic information is encoded in neural models of language has become popular in NLP. Researchers approach this enterprise by training"probes"- supervised models designed to extract linguistic structure from another model's output. One such probe is the structural probe (Hewitt and Manning, 2019), designed to quantify the extent to which syntactic information is encoded in contextualised word representations. The structural probe has a novel design, unattested in the parsing literature, the precise benefit of which is not immediately obvious. To explore whether syntactic probes would do better to make use of existing techniques, we compare the structural probe to a more traditional parser with an identical lightweight parameterisation. The parser outperforms structural probe on UUAS in seven of nine analysed languages, often by a substantial amount (e.g. by 11.1 points in English). Under a second less common metric, however, there is the opposite trend - the structural probe outperforms the parser. This begs the question: which metric should we prefer?


Introduction
Recently, unsupervised sentence encoders such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have become popular within NLP. These pre-trained models boast impressive performance when used in many language-related tasks, but this gain has come at the cost of interpretability. A natural question to ask, then, is whether these models encode the traditional linguistic structures one might expect, such as part-of-speech tags or dependency trees. To this end, researchers have invested in the design of diagnostic tools commonly referred to as probes (Alain and Bengio, 2017;Conneau et al., 2018;Hupkes et al., 2018;Poliak et al., 2018;Marvin and Linzen, 2018;Niven and Kao, 2019). Probes are supervised models designed to extract a target linguistic structure from the output representation learned by another model.
Based on the authors' reading of the probing literature, there is little consensus on where to draw the line between probes and models for performing a target task (e.g. a part-of-speech tagger versus a probe for identifying parts of speech). The main distinction appears to be one of researcher intent: probes are, in essence, a visualisation method (Hupkes et al., 2018). Their goal is not to best the state of the art, but rather to indicate whether certain information is readily available in a model-probes should not "dig" for information, they should just expose what is already present. Indeed, a sufficiently expressive probe with enough training data could learn any task (Hewitt and Liang, 2019), but this tells us nothing about a representation, so it is beside the point. For this reason, probes are made "simple" (Liu et al., 2019), which usually means they are minimally parameterised. 1 Syntactic probes, then, are designed to measure the extent to which a target model encodes syntax. A popular example is the structural probe (Hewitt and Manning, 2019), used to compare the syntax that is decodable from different contextualised word embeddings. Rather than adopting methodology from the parsing literature, this probe utilises a novel approach for syntax extraction. However, the precise motivation for this novel approach is not immediately clear, since it has nothing to do with model complexity, and appears orthogonal to the goal of a probe. Probes are designed to help researchers understand what information exists in a model, and unfamiliar ways of measuring this information may obscure whether we are actually gaining an insight about the representation we wish to examine, or the tool of measurement itself.
My displeasure in everything displeases me 1 2 3 4 5 6 Figure 1: Example of an undirected dependency tree. We observe that the syntactic distance between displeases and everything is 2 (the red path).
Using the structural probe as a case study, we explore whether there is merit in designing models specifically for the purpose of probing-whether we should distinguish between the fundamental design of probes and models for performing an equivalent task, as opposed to just comparing their simplicity. We pit the structural probe against a simple parser that has the exact same lightweight parameterisation, but instead employs a standard loss function for parsing. Experimenting on multiligual BERT (Devlin et al., 2019), we find that in seven of nine typologically diverse languages studied (Arabic, Basque, Czech, English, Finnish, Japanese, Korean, Tamil, and Turkish), the parser boosts UUAS dramatically; for example, we observe an 11.1-point improvement in English.
In addition to using UUAS, Hewitt and Manning (2019) also introduce a new metric-correlation of pairwise distance predictions with the gold standard. We find that the structural probe outperforms the more traditional parser substantially in terms of this new metric, but it is unclear why this metric matters more than UUAS. In our discussion, we contend that, unless a convincing argument to the contrary is provided, traditional metrics are preferable. Justifying metric choice is of central importance for probing, lest we muddy the waters with a preponderance of ill-understood metrics.

Syntactic Probing Using Distance
Here we introduce syntactic distance, which we will later train a probe to approximate.

Syntactic Distance
The syntactic distance between two words in a sentence is, informally, the number of steps between them in an undirected parse tree. Let w = w 1 · · · w n be a sentence of length n. A parse tree t belonging to the sentence w is an undirected spanning tree of n vertices (with a separate root as a (n + 1) th vertex), each representing a word in the sentence w. The syntactic distance between two words w i and w j , denoted ∆ t (w i , w j ), is defined as the shortest path from w i to w j in the tree t where each edge has weight 1. Note that ∆ t (·, ·) is a distance in the technical sense of the word: it is non-negative, symmetric, and satisfies the triangle inequality.
Tree Extraction Converting from syntactic distance to a syntactic tree representation (or vice versa) is trivial and deterministic: Proposition 1. There is a bijection between syntactic distance and undirected spanning trees. Proof. Suppose we have the syntactic distances ∆ t (w i , w j ) for an unknown, undirected spanning tree t. We may uniquely recover that tree by constructing a graph with an edge between w i and w j iff ∆ t (w i , w j ) = 1. (This analysis also holds if we have access to only the ordering of the distances between all |w| 2 pairs of words, rather than the perfect distance calculations-if that were the case, the minimum spanning tree could be computed e.g. with Prim's.) On the other hand, if we have an undirected spanning tree t and wish to recover the syntactic distances, we only need to compute the shortest path between each pair of words, with e.g. Floyd-Warshall, to yield ∆ t (·, ·) uniquely.

Probe, Meet Parser
In this section, we introduce a popular syntactic probe and a more traditional parser.

The Structural Probe
Hewitt and Manning (2019) introduce a novel method for approximating the syntactic distance ∆ t (·, ·) between any two words in a sentence. They christen their method the structural probe, since it is intended to uncover latent syntactic structure in contextual embeddings. 2 To do this, they define a parameterised distance function whose parameters are to be learned from data. For a word w i , let h i ∈ R d denote its contextual embedding, where d is the dimensionality of the embeddings from the model we wish to probe, such as BERT. Hewitt and Manning (2019) define the parameterised distance function where B ∈ R r×d is to be learned from data, and r is a user-defined hyperparameter. The matrix B B is positive semi-definite and has rank at most r. 3 The goal of the structural probe, then, is to find B such that the distance function d B (·, ·) best approximates ∆(·, ·). If we are to organise our training data into pairs, each consisting of a gold tree t and its corresponding sentence w, we can then define the local loss function as which is then averaged over the entire training Dividing the contribution of each local loss by the square of the length of its sentence (the |w (k) | 2 factor in the denominator) ensures that each sentence makes an equal contribution to the overall objective, to avoid a bias towards the effect of longer sentences. This global loss can be minimised computationally using stochastic gradient descent. 4

A Structured Perceptron Parser
Given that probe simplicity seemingly refers to parameterisation rather than the design of loss function, we infer that swapping the loss function should not be understood as increasing model complexity. With that in mind, here we describe an alternative to the structural probe which learns parameters for the same function d B -a structured perceptron dependency parser, originally introduced in McDonald et al. (2005). This parser's loss function works not by predicting every pairwise distance, but instead by predicting the tree based on the current estimation of the distances between each pair of words, then comparing the total weight of that tree to the total weight 3 To see this, let x ∈ R d be a vector. Then, we have that x B Bx = (Bx) (Bx) = ||Bx|| 2 2 ≥ 0. 4 Hewitt and Manning found that replacing dB(·, ·) in eq. (2) with dB(·, ·) 2 yielded better empirical results, so we do the same. For a discussion of this, refer to App. A.1 in Hewitt and Manning. Coenen et al. (2019) later offer a theoretical motivation, based on embedding trees in Euclidean space. of the gold tree (based on the current distance predictions). The local perceptron loss is defined as When the predicted minimum spanning tree t perfectly matches the gold tree t, each edge will cancel and this loss will equal zero. Otherwise, it will be positive, since the sum of the predicted distances for the edges in the gold tree will necessarily exceed the sum in the minimum spanning tree. The local losses are summed into a global objective: This quantity can also be minimised, again, with a stochastic gradient method. Though both the structural probe and the structured perceptron parser may seem equivalent under Prop. 1, there is a subtle but important difference. To minimise the loss in eq. (2), the structural probe needs to encode (in d B ) the rank-ordering of the distances between each pair of words within a sentence. This is not necessarily the case for the structured perceptron. It could minimise the loss in eq. (4) by just encoding each pair of words as "near" or "far"-and Prim's algorithm will do the rest. 5 4 Experimental Setup

Processing Results
Embeddings and Data We experiment on the contextual embeddings in the final hidden layer of the pre-trained multilingual release of BERT (Devlin et al., 2019), and trained the models on the Universal Dependency (Nivre et al., 2016) treebands (v2.4). This allows our analysis to be multilingual. More specifically, we consider eight typologically diverse languages (Arabic, Basque, Czech, Finnish, Japanese, Korean, Tamil, and Turkish), plus English. Decoding the Predicted Trees Having trained a model to find a d B (·, ·) that approximates ∆ t (·, ·), it is trivial to decode test sentences into trees (see Prop. 1). For an unseen sentence w = w 1 · · · w n , we compute the n × n pairwise distance matrix D: We can then compute the predicted tree t from D using Prim's algorithm, which returns the minimum spanning tree from the predicted distances.

Experiments
To compare the performance of the models, we use both metrics from Hewitt and Manning (2019), plus a new variant of the second.
UUAS The undirected unlabeled attachment score (UUAS) is a standard metric in the parsing literature, which reports the percentage of correctly identified edges in the predicted tree.
DSpr The second metric is the Spearman rankorder correlation between the predicted distances, which are output from d B , and the gold-standard distances (computable from the gold tree using the Floyd-Warshall algorithm). Hewitt and Manning term this metric distance Spearman (DSpr). While UUAS measures whether the model captures edges in the tree, DSpr considers pairwise distances between all vertices in the tree-even those which are not connected in a single step.
DSpr P +F W As a final experiment, we run DSpr again, but first pass each pairwise distance matrix D through Prim's (to recover the predicted tree) then through Floyd-Warshall (to recover a new distance matrix, with distances calculated based on the predicted tree). This post-processing would convert a "near"-"far" matrix encoding to a precise rankorder one. This should positively affect the results, in particular for the parser, since that is trained to predict trees which result from the pairwise distance matrix, not the pairwise distance matrix itself.

Results
This section presents results for the structural probe and structured perceptron parser. Figure 2a presents UUAS results for both models. The parser is the highest performing model on seven of the nine languages. In many of these the difference is substantial-in English, for instance, the parser outperforms the structural probe by 11.1 UUAS points. 6

UUAS Results
DSpr Results The DSpr results (Figure 2b) show the opposite trend: the structural probe outperforms the parser on all languages. The parser performs particularly poorly on Japanese and Arabic, which is surprising, given that these had the second and third largest sets of training data for BERT respectively (refer to Table 1 in the appendices). We speculate that this may be because in the treebanks used, Japanese and Arabic have a longer average sentence length than other languages.
DSpr P +F W Results Following the postprocessing step, the difference in DSpr (shown in Figure 3) is far less stark than previously suggested-the mean difference between the two across all nine languages is just 0.0006 (in favour of the parser). Notice in particular the improvement for both Arabic and Japanese-where previously (in the vanilla DSpr) the structured perceptron vastly underperformed, the post-processing step closes the gap almost entirely. Though Prop. 1 implies that we do not need to consider the full pairwise output of d B to account for global properties of tree, this is not totally borne out in our empirical findings, since we do not see the same trend in DSpr P +F W as we do in UUAS. If we recover the gold tree, we will have a perfect correlation with the true syntactic distance-but we do not always recover the gold tree (the UUAS is less than 100%), and therefore the errors the parser makes are pronounced.
6 Discussion: Probe v. Parser Although we agree that probes should be somehow more constrained in their complexity than models designed to perform well on tasks, we see no reason why being a "probe" should necessitate fundamentally different design choices. It seems clear from our results that how you design a probe has a notable effect on the conclusions one might draw about a representation. Our parser was trained to recover trees (so it is more attuned to UUAS), whilst the structural probe was trained to recover pairwise distances (so it is more attuned to DSpr)-viewed this way, our results are not surprising in the least. The fundamental question for probe designers, then, is which metric best captures a linguistic structure believed to be a property of a given representation-in this case, syntactic dependency. We suggest that probing research should focus more explicitly on this question-on the development and justification of probing metrics. Once a metric is established and well motivated, a lightweight probe can be developed to determine whether that structure is present in a model.
If proposing a new metric, however, the burden of proof lies with the researcher to articulate and demonstrate why it is worthwhile. Moreoever, this process of exploring which details a new metric is sensitive to (and comparing with existing metrics) ought not be conflated with an analysis of a particular model (e.g. BERT)-it should be clear whether the results enable us to draw conclusions about a model, or about a means of analysing one.
For syntactic probing, there is certainly no apriori reason why one should prefer DSpr to UUAS. If anything, we tentatively recommend UUAS, pending further investigation. The DSpr P +F W results show no clear difference between the models, whereas UUAS exhibits a clear trend in favour of the parser, suggesting that it may be easier to recover pairwise distances from a good estimate of the tree than vice versa. UUAS also has the advantage that it is well described in the literature (and, in turn, well understood by the research community).
According to UUAS, existing methods were able to identify more syntax in BERT than the structual probe. In this context, though, we use these results not to give kudos to BERT, but to argue that the perceptron-based parser is a better tool for syntactic probing. Excluding differences in parameterisation, the line between what constitutes a probe or a model designed for a particular task is awfully thin, and when it comes to syntactic probing, a powerful probe seems to look a lot like a traditional parser.

Conclusion
We advocate for the position that, beyond some notion of model complexity, there should be no inherent difference between the design of a probe and a model designed for a corresponding task. We analysed the structural probe (Hewitt and Manning, 2019), and showed that a simple parser with an identical lightweight parameterisation was able to identify more syntax in BERT in seven of nine compared languages under UUAS. However, the structural probe outperformed the parser on a novel metric proposed in Hewitt and Manning (2019), bringing to attention a broader question: how should one choose metrics for probing? In our discussion, we argued that if one is to propose a new metric, they should clearly justify its usage.