Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing

Analysing whether neural language models encode linguistic information has become popular in NLP. One method of doing so, which is frequently cited to support the claim that models like BERT encode syntax, is called probing; probes are small supervised models trained to extract linguistic information from another model's output. If a probe is able to predict a particular structure, it is argued that the model whose output it is trained on must have implicitly learnt to encode it. However, drawing a generalisation about a model's linguistic knowledge about a specific phenomena based on what a probe is able to learn may be problematic: in this work, we show that semantic cues in training data means that syntactic probes do not properly isolate syntax. We generate a new corpus of semantically nonsensical but syntactically well-formed Jabberwocky sentences, which we use to evaluate two probes trained on normal data. We train the probes on several popular language models (BERT, GPT, and RoBERTa), and find that in all settings they perform worse when evaluated on these data, for one probe by an average of 15.4 UUAS points absolute. Although in most cases they still outperform the baselines, their lead is reduced substantially, e.g. by 53% in the case of BERT for one probe. This begs the question: what empirical scores constitute knowing syntax?


'Twas Brillig, and the Slithy Toves
Recently, unsupervised language models like BERT (Devlin et al., 2019) have become popular within natural language processing (NLP). These pre-trained sentence encoders, known affectionately as BERToids (Rogers et al., 2020), have pushed forward the state of the art in many NLP tasks. Given their impressive performance, a natural question to ask is whether models like these implicitly learn to encode linguistic structures, such as part-of-speech tags or dependency trees.
There are two strains of research that investigate this question. On one hand, stimuli-analysis compares the relative probabilities a language model assigns to words which could fill a gap in a clozestyle task. This allows the experimenter to test whether neural models do well at capturing specific linguistic phenomena, such as subject-verb agreement (Linzen et al., 2016;Gulordava et al., 2018) or negative-polarity item licensing (Marvin and Linzen, 2018;Warstadt et al., 2019). Another strain of research directly analyses the neural network's representations; this is called probing. Probes are supervised models which attempt to predict a target linguistic structure using a model's representation as its input (e.g. Alain and Bengio, 2017;Conneau et al., 2018;Hupkes and Zuidema, 2018); if the probe is able to perform the task well, then it is argued that the model has learnt to implicitly encode that structure in its representation. 1 Work from this inchoate probing literature is frequently cited to support the claim that models like BERT encode a large amount of syntactic knowledge. For instance, consider the two excerpts below demonstrating how a couple of syntactic probing papers have been interpreted: 2 [The training objectives of BERT/GPT-2/XLNet] have shown great abilities to capture dependency between words and syntactic structures (Jawahar et al., 2019) (Tian et al., 2020) Further work has found impressive degrees of syntactic structure in Transformer encodings (Hewitt and Manning, 2019) (Soulos et al., 2020) Our position in this paper is simple: we argue that the literature on syntactic probing is methodologically flawed, owing to a conflation of syntax with semantics. We contend that no existing probing work has rigorously tested whether BERT encodes syntax, and a fortiori this literature should not be used to support this claim.
To investigate whether syntactic probes actually probe syntax (or instead rely on semantics), we train two probes ( §4) on the output representations produced by three pre-trained encoders on normal sentences-BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and RoBERTa . We then evaluate these probes on a novel corpus of syntactically well-formed sentences made up of pseudowords ( §3), and find that their performance drops substantially in this setting: on one probe, the average BERToid UUAS is reduced by 15.4 points, and on the other the relative advantage that BERT exhibits over a baseline drops by 53%. This suggests that the probes are leveraging statistical patterns in distributional semantics to aide them in the search for syntax. According to one of the probes, GPT-2 falls behind a simple baseline, but in some cases the leads remains substantial, e.g. 20.4 UUAS points in the case of BERT. We use these results not to draw conclusions about any BERToids' syntactic knowledge, but instead to urge caution when drawing conclusions from probing results. In our discussion, we contend that evaluating BERToids' syntactic knowledge requires more nuanced experimentation than simply training a syntactic probe as if it were a parser (Hall Maudslay et al., 2020), and call for the separation of syntax and semantics in future probing work.

Syntax and Semantics
When investigating whether a particular model encodes syntax, those who have opted for stimulianalysis have been careful to isolate syntactic phenomena from semantics (Marvin and Linzen, 2018;Gulordava et al., 2018;Goldberg, 2019), but the same cannot be said of most syntactic probing work, which conflates the two. To see how the two can be separated, consider the famous utterance of Chomsky (1957): (1) Colourless green ideas sleep furiously whose dependency parse is give in Figure 1. Chomsky's point is that (1) is semantically nonsensical, but syntactically well formed. Syntactic probes are typically evaluated on realworld data, not on Chomsky-style sentences of (1)'s ilk. The same is true for parsers, but from a machine-learning point of view this is not problematic, since the goal of a statistical parser is to parse well the data that one may encounter in the real world. The probing literature, however, is inherently making a epistemological claim: whether BERT knows syntax. 3 Indeed, we already know that BERT significantly improves the performance of statistical parsing models on real-world data (Zhou and Zhao, 2019); there is no reason to develop specialist probes to reinforce that claim. As probing consider a scientific qustion, it follows that the probing literature needs to consider syntax from a linguistic point of view and, thus, it requires a linguistic definition of syntax. At least in the generative tradition, it taken as definitional that grammaticality, i.e. syntactic well-formedness, is distinct from the meaning of the sentence. It is this distinction that the nascent syntactic probing literature has overlooked.

Generating Jabberwocky Sentences
To tease apart syntax and semantics when evaluating probes, we construct a new evaluation corpus of syntactically valid English Jabberwocky sentences, so called after Carroll (1871) who wrote verse consisting in large part of pseudowords (see App. A). In written language, a pseudoword is a sequence of letters which looks like a valid word in a particular language (usually determined by acceptability judgments), but which carries with it no lexical meaning.
For our Jabberwocky corpus, we make use of the ARC Nonword Database, which contains 358, 534 monosyllabic English pseudowords (Rastle et al., 2002). We use a subset of these which were filtered I povicated your briticists very much enjoyed presentations Figure 2: An unlabeled undirected parse from the EWT treebank, with Jabberwocky substitutions in red.
out then manually validated for high plausibility by Kharkwal (2014). We conjugate each of these words using hand-written rules assuming they obey the standard English morphology and graphotactics. This results in 1361 word types-a total of 2377 varieties when we annotate these regular forms with several possible fine-grained part-of-speech realisations.
To build sentences, we take the test portion of the English EWT Universal Dependency (UD; Nivre et al., 2016) treebank and substitute words (randomly) with our pseudowords whenever we have one available with matching fine-grained part-ofspeech annotation. 4 Our method closely resembles Kasai and Frank (2019), except they do so to analyse parsers in place of syntactic probes. An example of one of our Jabberwocky sentences is shown in Figure 2, along with its unlabeled undirected parse (used by the probes) which is taken from the vanilla sentence's annotation in the treebank.

Two Syntactic Probes
A syntactic probe is a supervised model trained to predict the syntactic structure of a sentence using representations produced by another model. The main distinction between syntactic probes and dependency parsers is one of researcher intentprobes are not meant to best the state of the art, but are a visualisation method (Hupkes and Zuidema, 2018). As such, probes are typically minimally parameterised so they do not "dig" for information (but see Pimentel et al., 2020). If a syntactic probe performs well using a model's representations, it is argued that that model implicitly encodes syntax. 4 More specifically, for nouns we treat elements annotated (in UD notation) with Number=Sing or Number=Plur; for verbs we treat VerbForm=Inf, VerbForm=Fin | Mood=Ind | Number=Sing | Person=3 | Tense=Pres, VerbForm=Fin | Mood=Ind | Tense=Pres, or VerbForm=Part | Tense=Pres; for adjectives and adverbs we treat Degree=Cmp or Degree=Sup, along with unmarked. These cases cover all regular forms in the EWT treebank.
Here we briefly introduce two syntactic probes, each designed to learn the syntactic distance between a pair of words in a sentence, which is the number of steps between them in an undirected parse tree (example in Figure 2). Hewitt and Manning (2019) first introduced syntactic distance, and propose the structural probe as a means of identifying it; it takes a pair of embeddings and learns to predict the syntactic distance between them. An alternative to the structural probe which learns parameters for the same function is a structured perceptron dependency parser, originally introduced in McDonald et al. (2005), and first applied to probing in Hall Maudslay et al. (2020). Here we call this the perceptron probe. Rather than learning syntactic distance directly, the perceptron probe instead learns to predict syntactic distances such that the minimum spanning tree that results from a sentence's predictions matches the gold standard parse tree. The difference between these probes is subtle, but they optimise for different metrics-this is reflected in our evaluation in §5.

Hast Thou [Parsed] the Jabberwock?
We train the probes on normal UDs, then evaluate them on Jabberwocky sentences; if the probes are really learning to extract syntax, they should perform just as well in the Jabberwocky setting.

Experimental Setup
Models to Probe We probe three popular Transformer (Vaswani et al., 2017)   . For all three we use the 'large' version. We train probes on the representations at multiple layers, and choose whichever layers result in the best performance on the development set. For each Transformer model, we also train probes on the layer 0 embeddings; we can treat these layer 0 embeddings as baselines since they are uncontextualised, with knowledge only of a single word and where it sits in a sentence, but no knowledge of the other words. As an additional baseline representation to probe, we use FastText embeddings (Bojanowski et al., 2017) appended with BERT position embeddings (Fast+Pos). We emphasise that none of these baselines can be said to encode anything about syntax (in a linguistic sense), since they are uncontextualised. Training details of these models and baselines can be found in App. B.   Additional Simple Baselines In addition to the baseline representations which we probe, we compute two even simpler baselines, which ignore the lexical items completely. The first simply connects each word to the word next to it in a sentence (Path). The second returns, for a given sentence length, the tree which contains the edges occurring most frequently in the training data (Majority), which is computed as follows: first, we subdivide the training data into bins based on sentence length. For each sentence length n, we create an undirected graph G n with n nodes, each corresponding to a different position in the sentence. The edges are weighted according to the number of times they occur in the training data bin which contains sentences of length n. The 'majority tree' of sentence length n is then computed by calculating the maximum spanning tree over G n , which can be done by negating the edges, then running Prim's algorithm. For n > 40, we use the Path baseline's predictions, owing to data sparsity.
Metrics As mentioned in §4, the probes we experiment with each optimise for subtly different aspects of syntax; we evaluate them on different metrics which reflect this. We evaluate the structural probe on DSpr, introduced in Hewitt and Manning (2019)-it is the Spearman correlation between the actual and predicted syntactic distances between each pair of words. We evaluate the perceptron probe using the unlabeled undirected attachment score (UUAS), which is the percentage of correctly identified edges. These different metrics reflect differences in the probe designs, which are elaborated in Hall Maudslay et al. (2020). Figure 3 shows the performance of the probes we trained, when they are evaluated on normal test data (plain) versus our specially constructed Jabberwocky data (hatched). Recall that the test sets have identical sentence-parse structures, and differ only insofar as words in the Jabberwocky test set have been swapped for pseudowords. 5 For each BERToid, the lower portion of its bars (in white) shows the performance of its layer 0 embeddings, which are uncontextualised and thus function as additional baselines.

Results
All the probes trained on the BERToids perform worse on the Jabberwocky data than on normal data, indicating that the probes rely in part on semantic information to make syntactic predictions. This is most pronounced with the perceptron probe: in this setting, the three BERToids' scores dropped by an average of 15.4 UUAS points. Although they all still outperform the baselines under UUAS, their advantage is less pronounced, but in some cases it remains high, e.g. for BERT the lead is 20.4 points over the Fast+Pos baseline. With the structural probe, BERT's lead over the simple Majority baseline is reduced from 0.078 to 0.037 DSpr, and RoBERTa's from 0.074 to 0.017-reductions of 53% and 77%, respectively. GPT-2 falls behind the baselines, and performs worse than even the simple Path predictions (0.580 compared to 0.584).

Discussion
Is BERT still the syntactic wunderkind we had all assumed? Or do these reductions mean that these models can no longer be said to encode syntax? We do not use our results to make either claim. The reductions we have seen here may reflect a weakness of the syntactic probes rather than a weakness of the models themselves, per se. In order to properly give the BERToids their due, one ought train the probes on data which controls for semantic cues (e.g. more Jabberwocky data) in addition to evaluating them on it. Here, we wish only to show that existing probes leverage semantic cues to make their syntactic predictions; since they do not properly isolate syntax, they should not be cited to support claims about syntax.
The high performance of the baselines (which inherently contain no syntax) is reason enough to be cautious about claims of these model's syntactic abilities. In general, single number metrics like these can be misleading: many correctly labeled easy dependencies may well obfuscate the mistakes being made on comparatively few hard ones, which may well be far more revealing (see, for instance, Briscoe and Carroll, 2006).
Even if these syntactic probes achieved near perfect results on Jabberwocky data, beating the baselines by some margin, that alone would not be enough to conclude that the models encoded a deep understanding of syntax. Dependency grammarians generally parse sentences into directed graphs with labels; these probes by comparison only identify undirected unlabeled parse trees (compare Figures 1 and 2 for the difference). This muchsimplified version of syntax has a vastly reduced space of possible syntactic structures. Consider a sentence with e.g. n = 5 words, for which there are only 125 possible unlabeled undirected parse trees (by Cayley's formula, n n−2 ). As the high performance of the Majority baseline indicates, these are not uniformly distributed (some parse trees are more likely than others); a probe might well use these statistical confounds to advance its syntactic predictions. Although they remain present, biases like these are less easily exploitable in the labeled and directed case, where there are just over one billion possible parse trees to choose from. 6 Syntax is an incredibly rich phenomena-far more so than when it is reduced to syntactic distance.

O Frabjous Day! Callooh! Callay!
In this work, we trained two syntactic probes on a variety of BERToids, then evaluated them using Jabberwocky sentences, and showed that performance dropped substantially in this setting. This suggests that previous results from the probing literature may have overestimated BERT's syntactic abilities. However, in this context, we do not use the results to make any claims about BERT; we contend that to make such a claim one ought train the probes on Jabberwocky sentences, which would require more psuedowords than we had available. Instead, we advocate for the separation of syntax and semantics in probing. Future work could explore the development of artificial treebanks for use specifically for training syntactic probes, which minimise for any confounding statistical biases in the data. We make our Jabberwocky evaluation data and code publicly available at https: //github.com/rowanhm/jabberwocky-probing.

A The Jabberwocky
'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe.
"Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought-So rested he by the Tumtum tree, And stood awhile in thought.
And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came! One, two! One, two! And through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back.
"And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!" He chortled in his joy.
'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe.

B Probe Training Details
For the Fast+Pos baseline, we use the base model of BERT, whose position embeddings are 768 dimensions, and the pretrained FastText embeddings trained on the Common Crawl (2M word variety with subword information). 7 Combining the position embeddings with the 300 dimensional FastText embeddings yields embeddings with 1068 dimensions for this baseline. By comparison, the 'large' version of the BERToids we train each consist of 24 layers, and produce embeddings which have 1024 dimensions.
Each BERToid we train uses a different tokenisation scheme. We need tokens which align with the tokens in the UD trees. In the case when one of the schemes does not split a word which is split in the UD trees, we merge nodes in the trees so they align. In the case where one of the schems splits a word which was not split in the UD trees, we use the first token. If the alignment is not easily fixed, we remove the sentence from the treebank. Table 1 shows the data split we are left with after sentences have been removed from the EWT UD treebank.

Dataset # Sentences
Train 9444 Dev 1400 Test 1398 To find optimimum hyperparameters, we perform a random search with 10 trials per model. When training, we used a batch size of 64 sentences, and as the optimiser we used Adam (Kingma and Ba, 2015). We consider three hyperparameters: the learning rate, the rank of the probe, and Dropout (Srivastava et al., 2014), over the ranges [5 × 10 −5 ], 5 × 10 −3 ], [1, d], and [0.1, 0.8] respectively, where d is the dimensionality of the input representation. Along with the Fast+Pos baseline, we also perform the search on BERT, RoBERTa and GPT-2 at every fourth layer (so a total of 7 varieties each), and choose the best layer based on loss on the development set. For each trial, we train for a maximum of 20 epochs, and use early stopping if the loss does not decrease for 15 consecutive steps.