A Non-Linear Structural Probe

Probes are models devised to investigate the encoding of knowledge—e.g. syntactic structure—in contextual representations. Probes are often designed for simplicity, which has led to restrictions on probe design that may not allow for the full exploitation of the structure of encoded information; one such restriction is linearity. We examine the case of a structural probe (Hewitt and Manning, 2019), which aims to investigate the encoding of syntactic structure in contextual representations through learning only linear transformations. By observing that the structural probe learns a metric, we are able to kernelize it and develop a novel non-linear variant with an identical number of parameters. We test on 6 languages and find that the radial-basis function (RBF) kernel, in conjunction with regularization, achieves a statistically significant improvement over the baseline in all languages—implying that at least part of the syntactic knowledge is encoded non-linearly. We conclude by discussing how the RBF kernel resembles BERT’s self-attention layers and speculate that this resemblance leads to the RBF-based probe’s stronger performance.


Introduction
Probing has been widely used in an effort to better understand what linguistic knowledge may be encoded in contextual word representations such as BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018). These probes tend to be designed with simplicity in mind and with the intent of revealing what linguistic structure is encoded in an embedding, rather than simply learning to perform an NLP task (Hewitt and Liang, 2019;Zhang and Bowman, 2018;Voita and Titov, 2020) This preference for simplicity has often led researchers to place restrictions on probe designs that may not allow them to fully exploit the structure in which information is encoded (Saphra and Lopez, 2019;Pimentel et al., 2020b,a). This preference has led many researchers to advocate the use of linear probes over non-linear ones (Alain and Bengio, 2017).
This paper treats and expands upon the structural probe of Hewitt and Manning (2019), who crafted a custom probe with the aim of investigating the encoding of sentence syntax in contextual representations. They treat probing for syntax as a distance learning problem: they learn a linear transformation that warps the space such that two words that are syntactically close to one another (in terms of distance in a dependency tree) should have contextual representations whose Euclidean distance is small. This linear approach performs well, but the restriction to learning only linear transformations seems arbitrary. Why should it be the case that this information would be encoded linearly within the representations?
In this paper, we recast Hewitt and Manning (2019)'s structural probing framework as a general metric learning problem. This reduction allows us to take advantage of a wide variety of non-linear extensions-based on kernelization-proposed in the metric learning literature (Kulis, 2013). These extensions lead to probes with the same number of parameters, but with an increased expressivity.
By exploiting a kernelized extension, we are able to directly test whether a structural probe that is capable of learning non-linear transformations improves performance. Empirically, we do find that non-linearity helps-a structural probe based on a radial-basis function (RBF) kernel improves performance significantly in all 6 languages tested over a linear structural probe. We then perform an analysis of BERT's attention, asserting it is a rough approximation to an RBF kernel. As such, it is not surprising that the syntactic information in BERT representations is more accessible with this specific non-linear transformation. We conclude that kernelization is a useful tool for analyzing contextual representations-enabling us to run controlled experiments and investigate the structure in which information is encoded.

The Structural Probe
Hewitt and Manning (2019) introduce the structural probe, a novel model designed to probe for syntax in contextual word representations. We review their formulation here and build upon it in §4. A sentence w lives in a space V * , defined here as the Kleene closure of a (potentially open) vocabulary V . The syntactic distance ∆ ij between any two words in a sentence w is the number of steps needed to go from one word to the other while walking in the sentence's syntactic tree. More formally, if we have a dependency tree t (a tree on n + 1 nodes) of a sentence w of length n, we define ∆ ij as the length of the shortest path in t between w i and w j ; this may be computed, for example, by Floyd-Warshall. Contextual representations of a sentence w are a sequence of vectors h i ∈ R d 1 that encode some linguistic knowledge about a sequence. In the case of BERT, we have Here, the goal of probing is to evaluate whether the contextual representations capture the syntax in a sentence. In the case of the structural probe, the goal is to see whether the syntactic distance between any two words can be approximated by a learned, linear distance function: where B ∈ R d 2 ×d 1 is a linear projection matrix. That is to say, they seek a linear transformation such that the transformed contextual representations relate to one another roughly as their corresponding words do in the dependency tree. To learn this probe, Hewitt and Manning minimize the following per-sentence objective with respect to B through stochastic gradient descent This is simply minimizing the difference between the syntactic distances obtained from the dependency tree and the distance between the two vectors under our learned transformation. From the pairwise distances predicted by the probe, Prim's (1957) algorithm can be used to recover the onebest undirected dependency tree.

Kernelized Metric Learning
The restriction to a linear transformation may hinder us from uncovering some of the syntactic structure encoded in the contextual representations. Indeed, there is no reason a-priori to expect that BERT encodes its knowledge in a fashion that is specifically accessible to a linear model. However, if we were to introduce non-linearity by using a neural probe, for example, we would have to pit a model with very few parameters (the linear model) against one with very many (the neural network); this comparison is not fair and also goes against the spirit of designing simple probes. To preclude the need for a neural probe, we instead turn to a kernelized probe.
The key insight is that the structural probe reduces the problem of probing for linguistic structure to that of metric learning (Kulis, 2013). This can be clearly seen in eq. (3), where the probe learns a distance metric between two representations in such a way that it matches the syntactic one. Recognizing this relationship allows us to take advantage of established techniques from the metric learning literature to improve the performance of the probe without increasing its complexity, e.g. through kernelization.

The "Kernel Trick" for Distances
Many algorithms in machine learning, e.g. support vector machines and k-means, can be kernelized (Schölkopf and Smola, 2002), thus allowing for linear models to be adapted into non-linear ones. Expanding on a classic result (Schoenberg, 1938), Schölkopf (2001) show that any positive semidefinite (PSD) kernel can be used to construct a distance in a Hilbert space H. Formally, their result states that for any PSD kernel κ : X × X → R ≥0 , there exists a feature map φ : X → H such that This generalizes eq. (2) to yield a new, non-linear distance metric. This means that we can achieve the effects of using some non-linear feature mapping φ without having to specify it: we need only specify a kernel function and perform calculations using this kernelized distance metric. Importantly, as opposed to deep neural probes, this learnable metric has an identical number of parameters to the original. 1

Common Kernels
In this section we introduce the kernels to be used. These kernels were chosen as they represent a comprehensive selection of commonly-used kernels in the metric learning literature (Kulis, 2013). The original work of Hewitt and Manning (2019) makes use of the linear kernel: The first non-linear kernel we consider is the polynomial kernel, defined as where d ∈ Z + and c ∈ R ≥0 . A polynomial kernel of degree d allows for d-order interactions between the terms. When working with BERT, this means that we may construct d-order conjunctions of the dimensions of the contextual representations input into the probe. Next, we consider the radial-basis function kernel (RBF). This kernel is also called the Gaussian kernel and is defined as This kernel has an alternative interpretation as a similarity measure between both vectors, being at its maximum value of 1 when h i = h j . In contrast to the polynomial kernel, the Gaussian kernel implies a feature map in an infinite dimensional Hilbert space. When the RBF kernel is used in our probe, we may rewrite eq. (2) as follows: Which is similar to the original linear case in eq. (2), but with a scaling term − 1 2σ 2 and a non-linearity exp(·). Finally, we consider, the sigmoid kernel, which is defined as 2 κ sig (h i , h j ) = tanh (a(Bh i ) (Bh j ) + b) (9) this syntax tree reconstruction task-selectivity control tasks work at the word type level, as opposed to the sentence one.
2 Lin and Lin (2003) observe that it is difficult to effectively tune a and b in the sigmoid kernel. They also note that although this kernel is not in fact PSD, it is PSD when a and b are both positive, which we enforce in this work.

Regularized Metric Learning
We also take advantage of two common regularization techniques employed in the metric learning literature to further improve the transformations learned; both act on the matrix A = B B and are added to the objective specified in eq. (3). The Frobenius norm regularizer takes the form This is the matrix analogue of the L 2 squared regularizer. Minimizing the Frobenius norm of the learned matrix has the effect of keeping the values in the matrix small. It has been a popular choice for regularization in metric learning with adaptations to a variety of problems (Schultz and Joachims, 2004;Kwok and Tsang, 2003). We also consider the trace norm regularizer, which is of the form The trace norm regularizer is the matrix analogue of the L 1 regularizer and it encourages the matrix A to be low rank. As Jain et al. (2010) point out, using a low-rank transformation in conjunction with a kernel corresponds to a supervised kernel dimensionality reduction method.

Experiments
We We present the results from our comparison of a re-implementation of Hewitt and Manning's (2019) linear structural probe and the non-linear kernelized probes in Table 1. The two evaluation metrics shown are unlabeled undirected attachment score (UUAS) and the Spearman rank-order correlation (DSpr) between predicted distances and gold standard pairwise distances. UUAS is a standard parsing metric expressing the percentage of correct attachments in the dependency tree, while DSpr is a measure of how accurately the probe predicts the overall ordering of distances between words. We can see that the use of an RBF kernel results in a statistically significant improvement in performance, as measured by UUAS, in all 6 of the languages tested. 5 For some languages this improvement is quite substantial, with Tamil seeing an improvement of 8.44 UUAS from the baseline probe to the RBF kernel probe.

The RBF Kernel and Self-Attention
The RBF kernel produces improvements across all analyzed languages. This suggests that it is indeed the case that syntactic structure is encoded nonlinearly in BERT. As such, analyzing this specific kernel may yield insights into what this structure is. Indeed, none of the other kernels systematically improve over the linear baseline, implying this is not just an effect of the non-linearity introduced through use of a kernel-the specific structure of the RBF kernel must be responsible. In this section, we argue that the reason that the RBF kernel serves as such a boon to probing is that it resembles BERT's attention mechanism; recall that BERT's attention mechanism is defined as where K and Q are linear transformations and d 2 is the dimension vectors are projected into.
K projects vector h i into a key vector, while Q projects h j into a query one. When the key and query vectors are similar (i.e. have a high dot product), the value of this equation is large and word j attends to word i. This bears a striking resemblance to the Gaussian kernel. Indeed, if we assume the linearly transformed representations have unit norm, i.e.
where we take σ 2 = √ d 2 . The similarity between eqs. (12) and (14) suggests the attention mechanism in BERT is, up to a multiplicative factor, roughly equivalent to an RBF kernel-as such, it is not surprising that the RBF kernel produces the strongest results.
The resemblance between these equations, taken together with the significant improvements in capturing syntactic distance, suggest that this encoded information indeed lives in an RBF-like space in BERT. Such information can then be used in its self-attention mechanism; allowing BERT to pay attention to syntactically close words when solving the cloze language modeling task. Being attentive to syntactically close words would also be supported by recent linguistic research, since words sharing syntactic dependencies have higher mutual information on average (Futrell et al., 2019).
The representations we analyze, though, are taken from BERT's final layer; as such, they are not trained to be used in any self-attention layerso why should such a resemblance be relevant? BERT's architecture is based on the Transformer (Vaswani et al., 2017), and uses skip connections between each self-attention layer. Such skip connections create an incentive for residual learning, i.e. only learning residual differences in each layer, while propagating the bulk of the information (He et al., 2016). As such, BERT's final hidden representations should roughly live in the same manifold as its internal ones.
It is interesting to note that the RBF kernel achieves the best performance in terms of UUAS in all languages, but it only twice achieves the best performance in terms of DSpr. This may be due to the fact that, as we can see by examination of eq. (8), the distance returned by the RBF kernel will not exceed 2, whereas syntactic distances in the tree will. Further, the gradient of the RBF kernel contains an exponential term which will cause it to go to zero as distance increases (while an examination of the unkernelized loss function reveals the opposite behavior). This means that it will be less sensitive to the distances between syntactically distant words and focus more on words with small distances. This may partially explain its better performance on UUAS, and comparably worse performance as measured by correlation (which counts pairwise differences between all words, not just those which are directly attached in the tree). Furthermore, our probe's focus on nearby words resembles the general attentional bias towards syntactically close words (Voita et al., 2019).
The direct resemblance between self-attention mechanisms and our proposed probe metric poses a new way of understanding results from more complex probes. While Reif et al. (2019) understood the Euclidean-squared distance of Hewitt and Manning as an isometric tree embedding, their geometric interpretation did not factor in the rest of BERT's architecture. Such simplified contextless probes cannot tell us how linguistic properties are processed by a sequence of learned modules (Saphra and Lopez, 2019). However, we consider representations in the context of the model which is expected to employ them. From this perspective, simpler metrics may be rough approximations to our RBF kernel space, which is actually capable of measuring linguistic properties that can be easily extracted by an attention-based architecture.

Conclusion
We find that the linear structural probe (Hewitt and Manning, 2019) used to investigate the encoding of syntactic structure in contextual representations can be improved through kernelization, yielding a non-linear model. This kernelization does not introduce additional parameters and thus does not increase the complexity of the probe-at least if one treats the number of parameters as a good proxy for model complexity. At the same time, the RBF kernel improves probe performance in all languages under consideration. This suggests that syntactic information may be encoded non-linearly in the representations produced by BERT. We hypothesize that this is true due to the similarity of the RBF kernel and BERT's self-attention layers. Hsuan-Tien Lin and Chih-Jen Lin. 2003. A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Neural Computation, 3:1-32.