A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance.Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a)model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.


Introduction
Endeavors to better understand transformer-based masked language models (MLMs), such as BERT, are ever growing since their introduction in 2017 (cf. Rogers et al. (2020) for an overview). While the BERTology movement has enhanced our knowledge on the reasons behind BERT's performance in various ways, still plenty remains unanswered. Less well studied and challenging are linguistic phenomena, where, besides contextual information, identification of an antecedent is needed, such as relative clauses (RCs). , e.g., analyzed BERT's comprehension of function words, showing how relativizers and prepositions are quite challenging for BERT. Similarly,  find RCs to be difficult for BERT in the CoLA acceptability tasks. In this paper, we focus on RCs in American English to further enhance our understanding of the grammatical and semantic knowledge captured by pre-trained MLMs, evaluating three models: BERT, RoBERTa, and ALBERT. For our analysis, we train probing classifiers, consider each models' performance on diagnostic cases, and test predictions in a masked language modeling task on selected semantic and grammatical constraints of RCs.
RCs are clausal post-modifiers specifying a preceding noun phrase (antecedent) and are introduced by a relativizer (e.g., which). Extensive corpus research (Biber et al., 1999) found that the overall most common relativizers are that, which, and who. The relativizer occupies the subject or object position in a sentence (see examples (1-a) and (1-b)). In subject RCs, the relativizer is obligatory (Huddleston andPullum, 2002, 1055), while in object position omission is licensed (e.g., zero in example (1-b)).
(1) a. Children who eat vegetables are likely to be healthy. (subject relativizer, relativizer is obligatory) b. This is the dress [that/which/zero] I brought yesterday. (object relativizer, omission possible) Relativizer choice depends on an interplay of different factors. 1 Among these factors, the animacy constraint (Quirk, 1957) is near-categorical: for animate head nouns the relativizer who (see Example 1) is strongly prioritized (especially over which) (D'Arcy and Tagliamonte, 2010).
Our aims are (1) to better understand whether sentence representations of pre-trained MLMs capture grammaticality in the context of RCs, (2) test the generalization abilities and weaknesses of probing classifiers with complex diagnostic cases, and (3) test prediction of antecedents and relativizers in a masked task considering also linguistic constraints. From a linguistic perspective, we ask whether MLMs correctly predict (a) grammatically plausible relativizers given certain types of antecedents (animate, inanimate) and vice versa grammatically plausible antecedents given certain relativizers (who vs. which/that), and (2) semantically plausible antecedents given certain relativizers considering the degree of specificity of predicted antecedents in comparison to target antecedents (e.g. boys as a more specific option than children in Example (1)). Moreover, we are interested in how these findings agree with probing results and investigate model specific behavior, evaluating and comparing the recent pre-trained MLMs: BERT, RoBERTa, and ALBERT. This is to our knowledge the first attempt comparing and analyzing performance of different transformer-based MLMs in such detail, investigating grammatical and semantic knowledge beyond probing.
Our main contributions are the following: (1) the creation of a naturalistic dataset for probing, (2) a detailed model comparison of three recent pre-trained MLMs, and (3) fine-grained linguistic analysis on grammatical and semantic knowledge. Overall, we find that all three MLMs show good performance on the probing task. Further evaluation, however, reveals model-specific issues with wrong agreement (where RoBERTa is strongest) and distance between antecedent-relativizer and relativizer-RC verb (on which BERT and ALBERT are better). Considering linguistic knowledge, all models perform better on grammatical rather than semantic knowledge. Out of the relativizers, which is hardest to predict. Considering model-specific differences, BERT outperforms the others in predicting the actual targets, while RoBERTa captures best grammatical and semantic knowledge. ALBERT performs worst overall.
2 Background 2.1 Models BERT (Devlin et al., 2019) is a transformer-based (Vaswani et al., 2017) bidirectional network trained on masked language modeling and next-sentence-prediction. The extent to which BERT captures linguistic knowledge is widely studied in previous works (see §2.2). RoBERTa (Liu et al., 2019b) differs from BERT in four important aspects: trained on more data, no next-sentence-prediction objective, achieves lower perplexity on the training data, while using larger vocabulary. Given RoBERTa's superior performance over BERT on the GLUE benchmark (Wang et al., 2018), RoBERTa has replaced BERT as the model of choice in several recent studies investigating MLMs (Pruksachatkun et al., 2020;Talmor et al., 2020). Nevertheless, RoBERTa's linguistic properties and how they differ from BERT's remain relatively unexplored. ALBERT (Lan et al., 2020) is a recently proposed alternative to BERT and RoBERTa. It uses weight-sharing across all hidden layers -effectively applying the same non-linear transformation at every layer -and factorizes the embedding matrix into two separate matrices, resulting in significantly fewer model parameters compared to BERT and RoBERTa. ALBERT was shown to outperform BERT and RoBERTa when fine-tuned on several English NLP downstream tasks (Lan et al., 2020). However, to our knowledge, no previous work has systematically evaluated the linguistic knowledge of ALBERT. For all models, we consider the base variants: BERT-base-cased, RoBERTa-base, ALBERT-base-v1 with 110M, 125M, and 12M parameters, respectively.

Related Work
Related work has investigated grammaticality of unidirectional language models using Minimal Pair Evaluation (Marvin and Linzen, 2018;Wilcox et al., 2019;Warstadt et al., 2020;Hu et al., 2020a,b). Out of work on MLM prediction-based evaluation (e.g., Goldberg (2019) (2020)), only Goldberg (2019) has so far evaluated MLM predictions in the context of grammaticality. Our study adds to this line of work, focusing on RCs combining MLM prediction based evaluation with probing. Work related to evaluating sentence embeddings considered prediction of sentence length, word content, and word order (Adi et al., 2016). Conneau et al. (2018) investigated an even broader range of linguistic properties. Extensive work has been done on probing pre-trained MLMs, especially BERT, for syntactic and semantic knowledge (see e.g., ; Liu et al. (2019a) and Rogers et al. (2020) for a more comprehensive overview).
Most similar to our work is , focusing on comparing evaluation methods, including probing and MLM evaluation to assess how models encode linguistic features. We contribute to this strand of research by building datasets from naturalistic (rather than artificial as in ) data and comparing three transformer models: BERT, RoBERTa, and ALBERT. Our focus is on RCs as challenging sentence types for pre-trained language models as shown by, e.g.,  (showing relativizers to be quite challenging for BERT) and Warstadt and Bowman (2019) (finding RCs difficult for BERT in the COLA probing tasks).

Probing MLMs Representations for Grammatical Knowledge of RCs
We train supervised probing classifiers (here: acceptability classifiers) to assess the linguistic knowledge contained in a model's representations, focusing on grammaticality. A model's grammaticality awareness should be reflected in the hidden representations produced by the model for that particular sentence. Hence, by training a classifier on top of these representations, we should be able to discriminate between grammatical and ungrammatical sentences based on their sentence embeddings. 2 Moreover, besides capturing knowledge on grammaticality, we also examine whether this knowledge becomes more or less separable in the representation, considering representations produced by different layers of a model.

Dataset Construction
To probe pre-trained MLMs' performance on sentences containing RCs, we construct a controlled set of 48,060 sentences and their acceptability labels using an automated procedure. Our probing dataset is a subset of naturally occurring sentences extracted from the fiction portion of the COCA corpus (Davies, 2015). First, we extract all sentences containing only one pronoun from the set {who, whom, whose, which, that}. We then parse the sentences using SpaCy dependency parser and keep only those sentences where the pronoun constitutes a relativizer (identified by the tag RELCL). From the parse tree, we also determine whether the relativizer fills the subject or the object position in the RC. The automatic selection procedure is illustrated in Fig. 2 in §A.1.
On top of grammatical sentences from the corpus, we manipulate the data obtaining a set with unacceptable ones. To this end, we populate three Boolean meta-data variables for each grammatical sample in our data using a set of hand-crafted linguistic rules: ANIMATE, RESTRICTIVE, and SUBJRC. Based on the values of these three meta-data variables, a set of modifications is applied to convert a grammatical sentence into an ungrammatical one. The procedure of creating the dataset is explained in detail in §A.1. Our final dataset consists of 42.7k and 5.3k samples for training and evaluation, respectively. Both splits are balanced, i.e. accuracy of the majority baseline is 50%. Table 1 presents a set of minimal pair examples that are generated using our procedure.

Experimental Setup
We train logistic regression acceptability classifiers on sentence embeddings obtained from the hidden layers of a pre-trained model.
We compute sentence embeddings of BERT, RoBERTa, and ALBERT and two non-contextualized baselines: GloVe embeddings (Pennington et al., 2014) trained on English Wikipedia and the Gigaword corpus and fasttext embeddings (Bojanowski et al., 2016) trained on English Wikipedia and News data (Mikolov et al., 2018). As an additional baseline we use a rule based classifier that simply classifies sentences containing a relativizer (who, which, that) as grammatical and as ungrammatical otherwise.  Input sentences are pre-processed by adding two special tokens, [CLS] and [SEP], at the beginning and end of each input sentence, respectively. 3 To construct sentence embeddings, we apply two different pooling strategies: CLS-and mean-pooling. 4 We obtain sentence embeddings from all hidden layers of the MLM models, including the non-contextualized embedding layer (layer 0). We treat accuracy on the acceptability classification task as a proxy for the linguistic knowledge encoded in a model's sentence embeddings, which we evaluate on a held-out test set. We use the huggingface transformers (Wolf et al., 2019) and flair (Akbik et al., 2018) libraries as well as Scikit-learn (Pedregosa et al., 2011) for obtaining embeddings and training the logistic regression classifiers. Code to reproduce our probing dataset and results is available online: https://github.com/uds-lsv/rc-probing. Fig. 1 shows the layer-wise probing accuracy for all models when using mean-pooling. A side-by-side comparison with CLS-pooling is shown in Fig. 4 in §A.2. Mean-pooling leads to significantly higher probing accuracies for all models, suggesting a sub-optimal encoding of sentence-level information in the CLS token representation. Hence, we stick to mean-pooling in the following. We find that the rule base classifier is a surprisingly strong baseline, outperforming both the GloVe and fasttext baselines. Moreover, Fig. 1 shows that all three transformer-based models improve significantly over the baselines for almost all layers. The only exception is layer 0 of ALBERT, which performs similarly to the fasttext baseline and worse than the rule based classifier. We attribute this finding to the embedding factorization of ALBERT. At lower (contextualized) layers (1-5), probing accuracies of BERT, RoBERTa, and ALBERT are almost identical. For higher layers, both BERT and RoBERTa improve over ALBERT, with ALBERT's probing accuracy remaining roughly constant. We note that ALBERT has significantly less parameters than BERT and RoBERTa (12M vs. 110M and 125M), which might explain the lower  Table 2: Test accuracy (in %) grouped by modification type (cf. Table 8 for statistics). For BERT, RoBERTa, and ALBERT we select the best model according to the probing results shown in Fig. 1. Numbers in parenthesis show the accuracy of the non-contextualized baseline (layer 0) for each model.

Probing Results and Discussion
probing accuracy. We provide a more detailed discussion, investigating the role of the number of parameters, in §A.2.1. Overall, the fact that probing accuracy is above 80% for almost all layers suggests a reasonable encoding of sentence-level linguistic knowledge relevant for grammaticality classification in all pre-trained models. Notably, linear separability with respect to grammaticality emerges very early in the sentence embeddings of all three models.
To get a better understanding of the accuracy achieved by each of the models, we select the best classifier according to the results in Fig. 1 and report test accuracy grouped by modification in Table 2. For comparison, we additionally report accuracies of the non-contextualized baselines (layer 0) of each model in parentheses. The results show that while contextualization leads to higher probing accuracy overall, it is especially important for the which → who and which → that samples. From a linguistic viewpoint this is surprising: e.g. replacing which by who clearly makes a sentence ungrammatical and is typically easy to detect for humans. When looking at the training data, however, this observation is not surprising at all. The vast majority of sentences (15k samples) belong to the no modification group contain who as a relativizer. All of these sentences are grammatical (cf. Table 8). On the other hand, which → who contains only 2.5k samples, all of them are ungrammatical. Hence, our results in Table 2 reveal that the non-contextualized baselines might learn a simple heuristic, classifying all sentences that contain who as a relativizer as grammatical. The results of the rule based baseline classifier give further evidence for this interpretation, showing that a simple classifier which bases its predictions only on the existence of a relativizer has a surprisingly high accuracy (comparable to the non-conextualized baselines) on our dataset. Interestingly, BERT and RoBERTa seem to be especially susceptible to this short-cut learning as shown by the results of the non-contextual baselines for BERT and RoBERTa on the relativizer omission and which → who modifications.

Diagnostics
To investigate generalization abilities and model specific behavior of the best probing classifiers, we evaluate them on a diagnostics dataset containing sentences with the following properties: (1) adjacent antecedent and relativizer (see example (2-a), grammatical), (2) longer distance between antecedent and relativizer (see example (2-b), grammatical), (3) longer distance between relativizer and RC verb (see example (2-c), grammatical), (4) wrong agreement between adjacent antecedent and relativizer (see example (2-d), ungrammatical), and (5) intervening agreement attractors leading to wrong agreement (see (2-e), ungrammatical). To create the dataset, we manually select four sentences and manipulate these according to each of the above mentioned cases. Three sentences have nominal antecedents and one has a clausal antecedent. For the grammatical manipulations (case 1-3), we additionally test between restrictive/non-restrictive variants of each sentence. Overall, we test a total of 32 sentences 5 and evaluate the models' confidence based on the un-normalized log probabilities (logit), where predictions > 0 result in classifying sentences as grammatical. In the manipulations according to case 2, besides considering nominal vs. clausal antecedent, we look at the number of words of intervening phrases (length of 3-7 intervening words). 6
(2) a. We just heard a debate which was about the differences in wage rates [. Case 1. For restrictive and non-restrictive variants and nominal antecedents, BERT and ALBERT correctly predict with relatively high confidence the sentences to be grammatical (restrictive: 2.7, nonrestrictive: 1.6). For the clausal antecedent, they fail in the restrictive variant (−0.39 and −1.18, respectively), while they can deal with non-restrictive RCs (0.89 and 0.57), the comma possibly being a strong indicator of non-restrictiveness. RoBERTa always wrongly predicts ungrammaticality.
Case 2. With intervening phrases between antecedent and relativizer (see example (2-b)), BERT and ALBERT again deal well in both restrictive and non-restrictive RCs for nominal antecedents, but fail for restrictive RCs in the clausal case. RoBERTa again fails to predict the grammatical class. Considering the length of the intervening phrase, for restrictive RCs, BERT deals with intervening phrases with high confidence (2.35), provided they are relatively short (3-4 words). For longer distance, BERT is less confident (−0.69) and chooses the (wrong) ungrammatical class. ALBERT acts similarly, but fails with higher confidence (−1.15). RoBERTa always fails without a clear pattern. For the non-restrictive RCs, both BERT and ALBERT correctly predict the grammatical class, even though distance affects their confidence (the longer the less confident: 2.52 for shorter, 0.54 for longer phrases). RoBERTa again predicts the ungrammatical class for all sentences, but is especially confident for greater distance between antecedent and relativizer (−0.55 for shorter, −1.04 for longer phrases).
Case 3. With restrictive RCs, RoBERTa has problems with intervening phrases between relativizer and RC verb, again always predicting the ungrammatical class. BERT predicts this case correctly, but fails at sentences with clausal antecedent, even though with lower confidence than ALBERT (−0.84 vs. −1.54). In non-restrictive RCs, BERT and ALBERT obtain perfect accuracy for both nominal and clausal antecedents, being quite confident in their predictions, while RoBERTa still fails with sometimes very high confidence. 7 Case 4. With wrong agreement and adjacent nominal antecedent and relativizer, RoBERTa is quite confident of the ungrammaticality of the sentences (−1.46), while BERT and ALBERT are confident in  the clausal antecedent case (−1.38 and −1.75, respectively), but much less confident or even wrong with nominal antecedents. Case 5. With intervening agreement attractors and wrong agreement, BERT gets confused due to the attractors (see example (2-e), where DeGeneres is considered the antecedent instead of debate) and is only confident in the clausal antecedent case (−1.38). RoBERTa, instead, is very confident in recognizing the wrong agreement (−1.24), while ALBERT gets confused as well, even though less and with much less confidence than BERT (0.08).
In summary, RoBERTa is quite confident in wrong agreement cases. BERT and ALBERT deal much better than RoBERTa with longer distances between antecedent and relativizer. Also, BERT and AL-BERT learn to recognize non-restrictive RCs quite well and can deal with phrases between relativizer and RC verb. Thus, even though the models achieve very high probing accuracy overall (see § 3.3), evaluating on more complex cases reveals that each model seems to rely on different kinds of information, strongly affecting the generalization abilities of the probing classifiers. While we are aware of the diagnostics set's very limited extend, hindering generalizable conclusions, the controlled diagnostics evaluation gives an indication of possible differences underlying prediction choices across models.

Motivation
For a more comprehensive picture of the differences between models, we perform a masked language modeling evaluation (Goldberg, 2019), looking at the models' predictions of relativizers as well as antecedents. Besides grammaticality, we also test whether the models capture semantic knowledge.

Analyzing Grammatical and Semantic Knowledge
We extract sentences containing restrictive 8 object or subject RCs with one of the three relativizers from the registers magazines and academic 9 from the COCA corpus (Davies, 2015). All sentences are formatted as described in § 3.2, e.g.
[CLS] The woman [MASK] studies linguistics. [SEP]. The size of each set of RC type depends on the frequency of available sentences in the COCA corpus and therefore varies between types from 20 to 50 sentences.

Relativizer Prediction
We test all three models by masking the relativizer, considering the following metrics for evaluation: (1) mean precision at 1 (MP@1) (model's precision of target prediction at first position), (2) mean target rank (MTR), and (3) normalized mean entropy (NME) (uncertainty of the model's prediction). For MP@1, all three models generally perform well across RCs apart from which (Table 4a). BERT and RoBERTa show similar performance, while ALBERT diverges slightly, with fairly weak predictions of who in object RCs and comparatively accurate predictions of which in both object and subject RCs. MTR is negatively correlated with MP@1 (the larger the divergence from 1, the lower the precision of the prediction). NME reflects the two other measures: BERT and RoBERTa are quite confident about their predominantly successful predictions, while ALBERT shows a higher uncertainty about overall much weaker predictions.
Next, we manually evaluate 10 the actual predictions according to three criteria: animacy (agreement between antecedent and relativizer), plausibility (semantically plausible sentence) and grammaticality (Table 4b). Results show that the actual felicity of the predictions is much stronger than MP@1 would suggest, since non-target predictions are not necessarily infelicitous. This is especially true for which, due to high interchangeability with that. While both relativizers are synonymous, which is primarily used in non-restrictive RCs. Since the sample sentences are restrictive, the models seem to predict the most frequent relativizer in restrictive RCs, which is that (see example (3)). (3) The action [MASK] it contemplates is command. (all models) (object RC, target= which, predic-tion= that) Animacy is predicted reliably by BERT and RoBERTa, while ALBERT seems to have issues with who object predictions. This is due to ALBERT's preference to predict that instead of who, especially in fuzzy cases of general nouns describing humans (person, people, family, etc.). Sometimes animacy is predicted correctly, however, case marking is infelicitous; the model seems to only take into account the subject and verb of the clause ignoring possible clausal complements (see example (5)).
(4) He spared no expense in moving, reassembling, and restoring buildings of people [MASK] he felt were the backbone of our nation. (ALBERT) (object RC, target= who, prediction=that) (5) Jennifer had been having an online affair with a person [MASK] she believed was a man named Christopher. (ALBERT) (object RC, target= who, prediction= whom) Accuracy for plausibility is very high, too, with ALBERT making the least plausible predictions. Note that plausibility entails grammaticality. Example (5) has animacy agreement, but is neither grammatical nor plausible. Grammaticality in contrast does not entail plausibility (see example (6)). Here, BERT predicts the most frequent collocate of whirl according to COCA (Davies, 2015).
(6) This is something else, for I saw a whirl [MASK] I knew was a large bass. (BERT) (object RC, target= which, prediction= wind) Of all criteria, predictions are most accurate for grammaticality. Even ALBERT reaches a precision of 90% on average. Finally, we consider the ratio of (any) relativizers predicted vs. other word class, indicating how well the models recognize a syntagmatic environment typical for a RC to occur. RoBERTa, again, is most successful here (95% of predictions are relativizers), followed by BERT (94%) and ALBERT (90%). In summary, BERT and RoBERTa perform equally well at target prediction, while RoBERTa is most successful qualitatively (grammaticality, plausibility, and relativizer prediction). Overall, our qualitative analysis shows that all models largely make grammatical, plausible and animacyconform predictions.

Antecedent Prediction
Here, we test all three models by masking the antecedent (see example (7)), considering again, mean precision (MP@1), mean target rank (MTR), and normalized mean entropy (NME) (Table 5a). Due to higher variation in lexical choices for antecedents, MP@1 is expectedly much lower than for relativizers. Best @1 predictions are achieved by all models for who and that object RCs. Comparing models, BERT achieves best performance in all cases. MTR is around 2.5 and 3.5, with antecedents in who subject  RCs being predicted most easily by all models (rank around 2.5), also reflected in NME. The models are equally confident in who object RCs, while they are less confident for the other cases, with ALBERT being the least confident and RoBERTa the most confident model.

(7)
Rheumatologists have to be medical detectives, because so many of the [MASK] that we treat are obscure. (object RC, masked target: diseases) Considering the first predicted antecedent, we qualitatively evaluate whether (a) the animacy constraint is met, i.e. who RCs should have animate antecedents, and that and which most prominently inanimate ones 11 (see (8)), (b) the sentence is plausible, (c) the sentence is grammatical (see example (9) being ungrammatical in noun-verb agreement of animals (plural) and was (singular) 12 ). Overall, BERT and RoBERTa show very high accuracy for these cases (Table 5b). In few cases, the animacy of the antecedent is changed (see example (10)) and sentences are grammatical but semantically implausible or ungrammatical and implausible (see example (10) and example (11)). ALBERT performs worst considering plausibility, especially for which in subject RCs (see example (12)), most cases being repetitions of words in the sentence (see example (13)).

(10)
The only thing that brightened the gloom stretching out before him was the goodness of the [MASK] that had offered him refuge before his trial. (target = family, prediction = darkness) (11) Coffee, the cash, and the goods they purchased, even when used in traditional exchange, were devoid of the social relations with [MASK] which were present in traditional products. (target = predecessors, prediction = humans) (12) He and Carolyn saw all of Kevin's college [MASK] that were within a day's drive of McIntyre.
(target = games, prediction = campuses) (13) The bill makes it illegal to adopt or enforce any law or [MASK] which allows gays to claim discrimination. (target = policy, prediction = law) In summary, all models have problems with which relativizer, while they perform best with who. For animacy and plausibility the models suffer the most, especially for subject RCs, but perform quite well for grammaticality, indicating a better awareness for grammatical than semantic knowledge. We further evaluate the models' semantic knowledge considering predicted types of antecedents ((a) identical to the target, (b) synonym, (c) hypernym, general noun, determiner, pronoun, (d) hyponym 13 , or (e) not directly related (i.e. no direct hierarchical relationship) or completely unrelated 14 ). Results (see Table 10, §A.3) show that all models perform less well on subject than object RCs. ALBERT performs worst as the majority falls into not directly related/unrelated antecedents or hypernyms (more general words). RoBERTa is quite good at predicting identical targets outperforming the other two models, especially in object RCs with who. In who subject RCs, BERT and ALBERT most often predict more general antecedents given more specific targets (hypernyms, e.g., person instead of girl, workers, etc.).

Discussion and Conclusion
With our work we moved towards tackling some issues in the evaluation of the linguistic capabilities of pre-trained transformer-based masked language models as, e.g., proposed by Rogers et al. (2020). Most prominently, we try to better understand how a strong performance on supervised probing tasks is reflected in the predictions of the language models. To do so, we create a dataset based on naturalistic (not artificially generated) data and perform an extensive evaluation of masked language predictions in the context of RCs. Moreover, rather than considering one model only, we compare three models to investigate the extend to which findings for BERT can be generalized to other transformer-based models.
Our probing results show a significant improvement of all three transformer-based models over the baselines for almost all layers, suggesting indeed encoding of linguistic knowledge relevant for grammaticality classification, as shown in previous work (e.g. Goldberg (2019)). Performance is similar across models, with BERT and RoBERTa performing slightly better than ALBERT, and while contextualization improves overall, we have shown it helps immensely when considering less frequent RC modification types such as which → who, for which uncontextualized baselines learn simple heuristics. Evaluation on a diagnostic set, however, clearly reveals weaknesses of probing classifiers and model-specific behavior. RoBERTa is quite confident in wrong agreement cases between antecedent and relativizer, while this is problematic for BERT and ALBERT. Considering distance between relativizer and antecedent/RC verb, they clearly outperform RoBERTa. Based on these insights, we conclude that viewing probing results in isolation can lead to overestimating the linguistic capabilities of a model.
Our masked language modeling evaluation provided deeper insights into model-specific differences. We evaluated relativizer as well as antecedent prediction. Overall, all models show better performance on grammatical than semantic knowledge (animacy and plausibility). Regarding relativizer prediction, all models perform worst on the target word which (plausible, as it is the most versatile of the relativizers). Comparing models, BERT is best in predicting the actual targets, RoBERTa outperforms the others in capturing grammatical and semantic knowledge, while ALBERT performs worst overall. Evaluation on semantic types of antecedents shows prediction of unrelated or not directly related antecedents, especially for ALBERT in which RCs. Interestingly, both BERT and ALBERT predict hierarchically more general antecedents in who RCs, while RoBERTa is able to better capture specificity.
We believe that more work in this direction will lead to (a) awareness of the complexity of linguistic knowledge that such models might have to capture, considering e.g. generalization tasks, and (b) provide enhancement in evaluation strategies on how to best capture linguistic knowledge, such as a combination of probing, diagnostic, and cloze tests, but also in developing best practices of evaluation.

A Appendix
A.1 Probing Dataset Fig. 2 shows how to identify sentences containing RCs using a dependency parse tree, which were obtained using SpaCy. A sentence with an RC can be identified if the antecedent has an outgoing edge with the tag RELCL. The RC in the sentence can be further extracted since the main verb in the RC would have an incoming edge with the tag RELCL, and the subtree where RC main verb is the head constitutes the RC. This procedure enables us to extract sentences with RCs that are grammatical from text. I find the book which you brought yesterday interesting. However, probing classifiers for grammatical acceptability require both grammatical and ungrammatical sentences. Since grammaticality of RCs in English depends on restrictiveness, animacy of the head noun, and whether the relativizer occupies the subject or the object position in the sentence, we populate three meta-data variables for each grammatical sentence with respect to these aspects of RCs: ANIMATE, RESTRICTIVE, and SUBJRC. The variables ANIMATE and RESTRICTIVE are populated using a set of hand-crafted rules for the usage of RCs in American English (illustrated in Fig. 3), while the variable SUBJRC is populated using the incoming edge to the relativizer in the parse tree (e.g. NSUBJ for subject RC vs. DOBJ for object RC). We discard all sentences where at least one meta-data variable cannot be populated with certainty.  Figure 3: Annotation decision process for meta-data variables: (a) Animate, relativizer who and which are directly categorized as animate and non-animate, respectively. The relativizer that can be either way; thus, we categorize these sentences based on the antecedent. We compile two disjoints sets for antecedents that exclusively occur either with who or which, and the decision is made based on the membership of the antecedent. If the antecedent is not a member of either set, the sentence is discarded.
(b) Restrictive can be easily identified since non-restrictive RCs in American English are always preceded by comma ",".
Using the aforementioned annotation procedure, and given the values of the three meta-data variables, each sentence can be manipulated to create an ungrammatical sentence that forms a minimal pair with the  original sentence. The resulting paradigms from the three meta-data variables and the set of all possible modifications that be applied to each paradigm are presented in Table 6. Each grammatical sentence in the dataset is manipulated using all possible modifications of the paradigm to generate a "bag-ofsentences" associated with each sentence (where the original grammatical sentence is a member of). To create the final dataset, we sample one sentence from each bag and construct a balanced dataset of 48,060 sentences with grammatical acceptability labels. Tables 7 and 8 show summary statistics of the dataset.    Fig.4 shows layer-wise probing accuracies for all models considered in this paper. CLS-pooling clearly leads to worse sentence representations performing on par with the non-contextualized GloVe and fasttext mean-embeddings for most layers. A.2.1 ALBERT-base-v1 vs. ALBERT-xxlarge-v1 Fig.5 shows layer-wise probing accuracies for ALBERT-base-v1 (ALBERT) and ALBERT-xxlarge-v1. ALBERT-xxlarge-v1 contains 235M parameters and hence roughly 10x as many as ALBERT-base-v1 and 2x as many as BERT and RoBERTa. As the results in Fig.5 show, the larger number of parameters indeed results in a significantly higher probing accuracy, outperforming all other models by a large margin. However, it should be noted that the larger number of parameters is the results of a larger dimensionality of ALBERT-xxlarge-v1's embeddings space, which is 5x larger that that of ALBERTbase-v1, BERT, and RoBERTa (4096 vs. 768). This makes the obtained probing results not directly comparable. We leave a more detailed investigation of the role of the size of the embedding space for future work.  Table 9 shows the results of the best ALBERT-xxlarge-v1 probing classifier compared to all other models grouped by modification.  Table 9: Test accuracy (in %) grouped by modification type (cf. Table 8 for statistics). For BERT, RoBERTa, ALBERT-base-v1 (ALBERT) and ALBERT-xxlarge-v1 we select the best model according to the probing results shown in Fig. 1. Numbers in parenthesis show the accuracy of the non-contextualized baseline (layer 0) for each model. ALBERT-xxlarge-v1 performs especially well on the which → who modification. Table 10 shows percentage for predicted types of antecedents: (a) identical to target, (b) synonym, (c) hypernym, general noun, determiner, pronoun, (d) hyponym (i.e. semantically lower in hierarchy and thus more specific), or (e) not directly related (i.e. no direct hierarchical relationship) or completely unrelated. Results show ALBERT to perform worst as the majority falls into not directly related/unrelated antecedents or hypernyms (more general words). RoBERTa is quite good at predicting identical targets outperforming the other two models, especially in object RCs with who. While the which case in object RCs is harder for all models, RoBERTa still makes best predictions considering identical and synonym prediction. All models perform less well on subject RCs, especially for which and that (40% to > 50% of unrelated/not directly related targets). Interestingly, for animate antecedents (who relativizer), the majority for all models falls into hypernyms, i.e. more general variants. This is especially the case for BERT and ALBERT.  Table 10: Predicted types of antecedents by relativizer in percentage. The type hypernym encompasses also general nouns, determiners, and pronouns.