Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs

Though state-of-the-art sentence representation models can perform tasks requiring significant knowledge of grammar, it is an open question how best to evaluate their grammatical knowledge. We explore five experimental methods inspired by prior work evaluating pretrained sentence representation models. We use a single linguistic phenomenon, negative polarity item (NPI) licensing, as a case study for our experiments. NPIs like any are grammatical only if they appear in a licensing environment like negation (Sue doesn’t have any cats vs. *Sue has any cats). This phenomenon is challenging because of the variety of NPI licensing environments that exist. We introduce an artificially generated dataset that manipulates key features of NPI licensing for the experiments. We find that BERT has significant knowledge of these features, but its success varies widely across different experimental methods. We conclude that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain.


Introduction
Recent sentence representation models have attained state-of-the-art results on language understanding tasks, but standard methodology for evaluating their knowledge of grammar has been slower to emerge. Recent work evaluating grammatical knowledge of sentence encoders like BERT (Devlin et al., 2018) has employed a variety of methods. For example, Shi et al. (2016), Ettinger et al. (2016), and Tenney et al. (2019) use probing tasks to target a model's knowledge of particular grammatical features. Marvin and Linzen (2018) and Wilcox et al. (2019) compare language models' probabilities for pairs of minimally different sentences differing in grammatical acceptability. Linzen et al. (2016), Warstadt et al. (2018), and Kann et al. (2019) use Boolean acceptability judgments inspired by methodologies in generative linguistics. However, we have not yet seen any substantial direct comparison between these methods, and it is not yet clear whether they tend to yield similar conclusions about what a given model knows.
We aim to better understand the trade-offs in task choice by comparing different methods inspired by previous work to evaluate sentence understanding models in a single empirical domain. We choose as our case study negative polarity item (NPI) licensing in English, an empirically rich phenomenon widely discussed in the theoretical linguistics literature (Kadmon and Landman, 1993;Giannakidou, 1998;Chierchia, 2013, a.o.). NPIs are words or expressions that can only appear in environments that are, in some sense, negative (Fauconnier, 1975;Ladusaw, 1979;Linebarger, 1980). For example, any is an NPI because it is acceptable in negative sentences (1) but not positive sentences (2); negation thus serves as an NPI licensor. NPIs furthermore cannot be outside the syntactic scope of a licensor (3). Intuitively, a licensor's scope is the syntactic domain in which an NPI is licensed, and it varies from licensor to licensor. A sentence with an NPI present is only acceptable in cases where (i) there is a licensoras in (1) but not (2)-and (ii) the NPI is within the scope of that licensor-as in (1) but not (3).
*Any cookies haven't been eaten.
We compare five experimental methods to test BERT's knowledge of NPI licensing. We consider: (i) a Boolean acceptability classification task to test BERT's knowledge of sentences in isolation, (ii) an absolute minimal pair task evaluating whether the absolute Boolean outputs of acceptability classifiers distinguish between pairs of minimally different sentences that differ in acceptability and each isolate a single key property of NPI licensing, (iii) a gradient minimal pair task evaluating whether the gradient outputs of acceptability classifiers distinguish between minimal pairs, (iv) a cloze test evaluating the grammatical preferences of BERT's masked language modeling head, and (v) a probing task directly evaluating BERT's representations for knowledge of specific grammatical features relevant to NPI licensing.
We find that BERT does have knowledge of all the key features necessary to predict the acceptability of NPI sentences in our experiments. However, our five methods give meaningfully different results. While the gradient minimal pair experiment and, to a lesser extent, the acceptability classification and cloze tests indicate that BERT has systematic knowledge of all NPI licensing environments and relevant grammatical features, the absolute minimal pair and probing experiments show that BERT's knowledge is in fact not equal across these domains. We conclude that each method depicts different relevant aspects of a model's grammatical knowledge; comparing both gradient and absolute measures of performance of models gives a more complete picture. We recommend that future studies use multiple complementary methods to evaluate model performance.
2 Related Work Evaluating Sentence Encoders The success of sentence encoders and broader neural network methods in NLP has prompted significant interest in understanding the linguistic knowledge encapsulated in these models.
A portion of related work focuses on Boolean classification tasks of English sentences to evaluate the grammatical knowledge encoded in these models. The objective of this task is to predict whether a single input sentence is acceptable or not, abstracting away from gradience in acceptability judgments. Linzen et al. (2016) train classifiers on this task using data with manipulated verbal inflection to investigate whether LSTMs can identify subject-verb agreement violations, and therefore a (potentially long distance) syntactic dependency. Warstadt et al. (2018) train models on this task using the CoLA corpus of acceptabil-ity judgments as a method for evaluating domain general grammatical knowledge, and Warstadt and Bowman (2019) analyze how these domain general classifiers perform on specific linguistic phenomena. Kann et al. (2019) use this task to evaluate whether sentence encoders represent information about verbal argument structure classes.
A related method employs minimal pairs, consisting of two sentences that differ minimally in their content and differ in linguistic acceptability, to judge whether a model is sensitive to a single grammatical feature. Marvin and Linzen (2018) and Wilcox et al. (2019) apply this method to phenomena such as subject-verb agreement, NPI licensing, and reflexive licensing.
Another branch of work uses probing tasks in which the objective is to predict the value of a particular linguistic feature given an input sentence. Probing tasks have been used to investigate whether sentence embeddings encode syntactic and surface features such as tense and voice (Shi et al., 2016), sentence length and word content (Adi et al., 2016), or syntactic depth and morphological number (Conneau et al., 2018). Giulianelli et al. (2018) use diagnostic classifiers to track the propagation of information in RNNbased language models. Ettinger et al. (2018) and Dasgupta et al. (2018) use automatic data generation to evaluate compositional reasoning. Tenney et al. (2019) introduce sub-sentence level probing tasks derived from common NLP tasks.
Negative Polarity Items In the theoretical literature on NPIs, proposals have been made to unify the properties of the diverse NPI licensing environments. For example, a popular view states that NPIs are licensed if and only if they occur in downward entailing (DE) environments (Fauconnier, 1975;Ladusaw, 1979), i.e. an environment that licences inferences from sets to subsets. 1 For instance, (4) shows that the environment under the scope of negation is DE. The property been to Paris is a subset of been to France, because those who have been to Paris are a subset of those who have been to France. While the inference from set to subset is normally invalid (4-b), it is valid if the property is embedded under negation (4-a). According to the DE theory of NPI licens-ing, this explains the contrast between (1) and (2).
(4) a. I haven't been to France.
! I haven't been to Paris. (DE) b. I have been to France.
6 ! I have been to Paris. (not DE) This view does not capture licensing in some environments, for example questions. No theory is yet accepted as identifying the unifying properties of all NPI licensing environments.
Within computational linguistics, NPIs are used as a testing ground for neural network models' grammatical knowledge. Marvin and Linzen (2018) find that LSTM LMs do not systematically prefer sentences with licensed NPIs (1) over sentencew with unlicensed NPIs (2). Jumelet and Hupkes (2018) shows LSTM LMs find a relation between the licensing context and the negative polarity item, and appears to be aware of the scope of this context. Wilcox et al. (2019) use NPIs and filler-gap dependencies, as instances of non-local grammatical dependencies, to probe the effect of supervision with hierarchical structure. They find that structurally-supervised models outperform state-of-the-art sequential LSTM models, showing the importance of structure in learning non-local dependencies like NPI licensing.
CoLA We use the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) in our experiments to train supervised acceptability classifiers. CoLA is a dataset of over 10k syntactically diverse example sentences from the linguistics literature with Boolean acceptability labels. As is conventional in theoretical linguistics, sentences are taken to be acceptable if native speakers judge them to be possible sentences in their language. Such sentences are widely used in linguistics publications to illustrate phenomena of interest. The examples in CoLA are gathered from diverse sources and represent a wide array of syntactic, semantic, and morphological phenomena. As measured by the GLUE benchmark , acceptability classifiers trained on top of BERT and related models reach near-human performance on CoLA.

Methods
We experiment with five approaches to the evaluation of grammatical knowledge of sentence representation models like BERT using our generated NPI acceptability judgment dataset ( §4). Each data sample in the dataset contains a sentence, a Boolean label which indicates whether the sentence is grammatically acceptable or not, and three Boolean meta-data variables (licensor, NPI, and scope; Table 2). We evaluate four model types: BERT-large, BERT with fine-tuning on one of two tasks, and a simple bag-of-words baseline using GloVe word embeddings (Pennington et al., 2014).
Boolean Acceptability We test the model's ability to judge the grammatical acceptability of the sentences in the NPI dataset. Following standards in linguistics, sentences for this task are assumed to be either totally acceptable or totally unacceptable. We fine-tune the sentence representation models to perform these Boolean judgments. For BERT-based sentence representation models, we add a classifier on top of the [CLS] embedding of the last layer. For BoW, we use a max pooling layer followed by an MLP classifier. The performance of the models is measured as Matthews Correlation Coefficient (MCC; Matthews, 1975) 2 between the predicted label and the gold label.
Absolute Minimal Pair We conduct a minimal pair experiment to analyze Boolean acceptability classifiers on minimally different sentences. Two sentences within paradigm form a minimal pair if they differ in one feature (licensor, NPI, or scope), but have different acceptability. For example, the sentences in (1) and (2) differ in whether an NPI licensor (negation) is present. We evaluate the models trained on acceptability judgments with the minimal pairs. In absolute minimal pair evaluation, the models needs to correctly classify both sentences in the pair to be counted as correct.

Gradient Minimal Pair
The gradient minimal pair evaluation is a more lenient version of absolute minimal pair evaluation: Here, we count a pair as correctly classified as long as the Boolean classifier's output for the acceptable class is higher for the acceptable sentence than for the unacceptable sentence. In other words, the classifier need only predict the acceptable sentence has the higher likelihood of being acceptable, but need not correctly predict that it is acceptable.
Cloze Test In the cloze test, a standard sentencecompletion task, we use the masked language modeling (MLM) component in BERT (Devlin et al., 2018) and evaluate whether it assigns a higher probability to the acceptable sentence in a minimal pair, following Linzen et al. (2016). An MLM predicts the probability of a single masked token based on the rest of the sentence. The minimal pairs tested are a subset of those in the absolute and gradient minimal pair experiments, where both sentences must be equal in length and differ in only one token. This differing token is replaced with [MASK], and the minimal pair is taken to be classified correctly if the MLM assigns a higher probability to the token from the acceptable sentence. In contrast with the other minimal pair experiments, this experiment is entirely unsupervised, using BERT's native MLM functionality.
Feature Probing We use probing classifiers as a more fine-grained approach to the identification of grammatical variables. We freeze the sentence encoders both with and without fine-tuning from the acceptability judgment experiments and train lightweight classifiers on top of them to predict meta-data labels corresponding to the key properties a model must learn in order to judge the acceptability of NPI sentences. These properties are whether a licensor is present, whether an NPI is present, and whether the NPI (or a similar item) is in the syntactic scope of the licensor (or a similar item). Crucially, each individual meta-data label by itself does not decide acceptability (i.e., these probing experiments test a different but related set of knowledge from acceptability experiments).

Data
In order to probe BERT's performance on sentences involving NPIs, we generate a set of sentences and acceptability labels for the experiments in this paper. We use generated data so that we can assess minimal pairs, and so that there are sufficient unacceptable sentences. We release all our data 3 and our generation code and vocabulary. 4 Licensing Features We create a controlled set of 136,000 English sentences using an automated sentence generation procedure, inspired in large part by previous work by Ettinger et al. (2016Ettinger et al. ( , 2018, Marvin and Linzen (2018), Dasgupta et al. (2018), and Kann et al. (2019). The set contains nine NPI licensing environments (Table 1), and two NPIs (any, ever). All but one licensor-NPI pair follows a 2⇥2⇥2 paradigm, which manipulates three boolean NPI licensing features: licensor presence, NPI presence, and the occurrence of an NPI within a licensor's scope. Each 2⇥2⇥2 paradigm forms 5 minimal pairs. Table 2 shows an example paradigm.
The Licensor feature indicates whether an NPI licensor is present in the sentence. For many environments, there are multiple lexical items that serve as a licensor (e.g., the adverbs environment contains rarely, hardly, seldom, barely, and scarcely as NPI licensors). When the licensor is not present, we substitute it with a licensor replacement that has a similar syntactic distribution but does not license NPIs, again using multiple appropriate lexical items as replacements. For example, licensor replacements for quantifier licensors like every include quantifiers like some, many, and more than three that do not license NPIs.
The NPI feature indicates whether an NPI is in the sentence or if it is substituted by an NPI replacement with similar structural distribution. For example, NPI replacements for ever include other adverbs such as often, sometimes, and certainly.
The Scope feature indicates whether the NPI/NPI replacement is within the scope of the licensor/licensor replacement. As illustrated earlier in (3), a sentence containing an NPI is only acceptable when the NPI falls within the scope of the licensor. What constitutes the scope of a licensor is highly dependent on the type of licensor.
The exception to the 2⇥2⇥2 paradigm is the Simple Questions licensing condition, with a reduced 2⇥2 paradigm. It lacks a scope manipulation because the question takes scope over the entire clause, and in Simple Questions the clause is the whole sentence. The paradigm for Simple Questions is given in Table 3 in the Appendix; it forms only 2 minimal pairs. Data Generation To generate sentences, we create sentence templates for each condition. Templates follow the general structure illustrated in example (5), in which the part-of-speech (auxiliary verb, determiner, noun, verb), as well as the instance number is specified. For example, N2 is the second instance of a noun in the template. We use these labels here for illustrative purposes; in reality, the templates include more fine-grained properties, such as verb tense and noun number.   Table 2: Example 2⇥2⇥2 paradigm using the Questions environment. The licensor (whether) or licensor replacement (that) is in bold. The NPI (ever) or NPI replacement (often) is in italics. When licensor=1, the licensor is present rather than its replacement word. When NPI=1, the NPI is present rather than its replacement. The scope of the licensor/licensor replacement is shown in square brackets (brackets, italicization, and boldface are not present in the actual data). When scope=1, the NPI/NPI replacement is within the scope of the licensor/licensor replacement. Unacceptable sentences are marked with *. The five minimal pairs are connected by arrows that point from the unacceptable to the acceptable sentence.  Given the specifications encoded in the sentence templates, words were sampled from a vocabulary of over 1000 lexical items annotated with 30 syntactic, morphological, and semantic features. The annotated features allow us to encode selectional requirements of lexical items, e.g., what types of nouns a verb can combine with. This avoids blatantly implausible sentences.
For each environment, the training set contains 10K sentences, and the dev and test sets contain 1K sentences each. Sentences from the same paradigm are always in the same set.
In addition to our data set, we also test BERT on a set of 104 handcrafted sentences from the NPI sub-experiment in Wilcox et al. (2019), who use a paradigm that partially overlaps with ours, but has an additional condition where the NPI linearly follows its licensor while not being in the scope of the licensor. This is included as an additional test set for evaluating acceptability classifiers in (6).

Data validation We use Amazon Mechanical
Turk (MTurk) to validate a subset of our data to assure that the generated sentences contrast in acceptability as intended. We randomly sample 500 sentences in approximately equal amounts from each environment, NPI and paradigm. Each sentence is rated on a Likert scale of 1-6, with 1 being "the sentence is not possible in English" and 6 being "the sentence is possible in English" by 20 unique participants in the US who self-identified as native English speakers. Participants are compensated $0.25 per HIT and are often able to complete a HIT of 5 sentences in just under 1 minute. Table 4 in the Appendix shows the participants' scores transformed into a Boolean judgment of 0 (unacceptable, score  3) or 1 (acceptable, score 4) and presented as the percentage of 'acceptable' ratings assigned to the sentences in each of the NPI-licensing environments. Across all NPIlicensing environments, 81.3% of the sentences labelled acceptable are assigned an acceptable rating by the MTurk raters, and 85.2% of sentences labeled unacceptable are assigned an unacceptable rating. This gives an overall agreement rating of 82.8% 5 . A Wilcoxon signed-rank test (Wilcoxon, 1945) shows that within each environment and for each NPI, the acceptable sentences are more often rated as acceptable by our MTurk validators than the unacceptable sentences (all p-values < 0.001). This contrast holds considering both the raw Likert-scale responses and the responses transformed to a Boolean judgment.

Experimental Settings
We conduct our experiments with the jiant 0.9 6 (Wang et al., 2019) multitask learning and transfer learning toolkit, the AllenNLP platform (Gardner et al., 2018), and the BERT implementation from HuggingFace. 7 Models We study the following sentence understanding models: (i) GloVe BoW: a bag-of-words baseline obtained by max-pooling of 840B tokens 300-dimensional GloVe word embeddings  We observe that sentences with 'any' get over-accepted. Overall agreement for sentences with 'ever' is 87.3%, while agreement for those with 'any' is 78.3%. We believe this is due to a free-choice interpretation of 'any' occurring more easily than is often reported in the semantics literature. 6 https://github.com/nyu-mll/jiant/ tree/blimp-and-npi/scripts/bert_npi 7 https://github.com/huggingface/ pytorch-pretrained-BERT nington et al., 2014) and (ii) BERT (Devlin et al., 2018): we use the cased version of BERT-large model, which works the best for our tasks in pilot experiments. In addition, since recent work (Liu et al., 2019;Stickland and Murray, 2019) has shown that intermediate training on related tasks can meaningfully impact BERT's performance on downstream tasks, we also explore two additional BERT-based models-(iii) BERT!MNLI: BERT fine-tuned on the Multi-Genre Natural Language Inference corpus (Williams et al., 2018), motivated both by prior work on pretraining sentence encoders on MNLI (Conneau et al., 2017) as well as work showing significant improvements to BERT on downstream semantic tasks (Phang et al., 2018; (iv) BERT!CCG: BERT fine-tuned on Combinatory Categorial Grammar Bank corpus (Hockenmaier and Steedman, 2007), motivated by Wilcox et al.'s (2019) finding that structural supervision may improve a LSTM-based sentence encoders knowledge on non-local syntactic dependencies.

Training-Evaluation Configurations
We are interested in whether sentence representation models learn NPI licensing as a unified property. Can the models generalize from trained environments to previously unseen environments? To answer these questions, for each NPI environment, we extensively test the performance of the models in the following configurations: (i) CoLA: training on CoLA, evaluating on the environment. (ii) 1 NPI: training and evaluating on the same NPI environment. (iii) Avg Other NPI: training independently on every NPI environment except one, averaged over the evaluation results on that environment. (iv) All-but-1 NPI: training on all environments except for one environment, evaluating on that environment. (v) All NPI: training on all environments, evaluating on the environment.

Results
Acceptability Judgments The results in Fig. 1 show that BERT outperforms the BoW baseline on all test data with all fine-tuning settings. Within each BERT variant, MCC reaches 1.0 on all test data in the 1 NPI setting. In the All-but-1 NPI training-evaluation configuration, the performance on all NPI environments for BERT drops. While the MCC value on environments like conditionals and sentential negation remains above 0.9, on the simple question environment it drops to 0.58. Figure 2: Results from the minimal pair test. The top section shows the average accuracy for detecting the presence of the NPI, the middle section shows average accuracy for detecting the presence of the licensor, and the bottom shows average accuracy of minimal pair contrasts that differ in whether the NPI is in scope of the licensor. Within each section, we show performance of GloVe BoW and BERT models under both absolute preference and gradient preference evaluation methods. The rows represent the training-evaluation configuration, while the columns represent different licensing environments.
Compared with NPI data fine-tuning, CoLA finetuning results in BERT's lower performance on most of the NPI environments but better performance on data from Wilcox et al. (2019).
In comparing the three BERT variants (see full results in Figure 5 in the Appendix), the Avg Other NPI shows that on 7 out of 9 NPI environments, plain BERT outperforms BERT!MNLI and BERT!CCG. Even in the remaining two environments, plain BERT yields about as good performance as BERT!MNLI and BERT!CCG, indicating that MNLI and CCG fine-tuning brings no obvious gain to acceptability judgments.

Absolute and Gradient Minimal Pairs
The results (Fig. 2) show that models' performance hinges on how minimal pairs differ. When tested on minimal pairs differing by the presence of an NPI, BoW and plain BERT obtain (nearly) perfect accuracy on both absolute and gradient measures across all settings. For minimal pairs differing by licensor and scope, BERT again achieves near perfect performance on the gradient measure, while BoW does not. On the absolute measure, both BERT and BoW perform worse. Overall, it shows that absolute judgment is more challenging when targeting licensor, which involves a larger pool of lexical items and syntactic configurations than NPIs; and scope, which requires nontrivial syntactic knowledge about NPI licensing.
As in the acceptability experiment, we find that intermediate fine-tuning on MNLI and CCG does not improve performance (see full results in Figures 6-8 in Appendix). Cloze Test The results (Fig. 3) show that even without supervision on NPI data, the BERT MLM can distinguish between acceptable and unacceptable sentences in the NPI domain. Performance is highly dependent on the NPI-licensing environment and type of minimal pair. Accuracy for detecting NPI presence falls between 0.76 and 0.93 for all environments. Accuracy for detecting licensor presence is much more variable, with the BERT MLM achieving especially high performance in conditional, sentential negation, and only environments; and low performance in quantifier and superlative environments.
Feature Probing Results (Fig. 4) show that plain BERT outperforms the BoW baseline in detecting whether NPI is in the scope of the licensor (henceforth 'scope detection'). As expected, BoW is nearly perfect in detecting presence of NPI and licensor, as these tasks do not require knowledge of syntax or word order. Consistent with results from previous experiments, detecting the presence of a licensor is slightly more challenging for models fine-tuned with CoLA or NPI data. However, the overall lower performances in scope detection compared with detecting the presence of the licensor is not found in the minimal-pair experiments.
CoLA fine-tuning improves the performance for BERT, especially for detecting NPI presence. Fine-tuning on NPI data improves scope detection. Inspection of environment-specific results shows that models struggle when the superlative, quantifiers, and adverb environments are the held-out test sets in the All-but-1 NPI fine-tuning setting.
Different from other experiments, BERT and BERT!MNLI have comparable performance across many settings and tasks, beating BERT!CCG especially in scope detection (see full results in Figure 9 in the Appendix).

Discussion
We find that BERT systematically represents all features relevant to NPI licensing across most individual licensing environments, according to certain evaluation methods. However, these results vary widely across the different methods we compare. In particular, BERT performs nearly perfectly on the gradient minimal pairs task across all of minimal pair configurations and nearly all licensing environments. Based on this method alone, we might conclude that BERT's knowledge of this domain is near perfect. However, the other methods show a more nuanced picture.
BERT's knowledge of which expressions are NPIs and NPI licensors is generally stronger than its knowledge of the licensors' scope. This is especially apparent from the probing results (Fig. 4). BERT without acceptability fine-tuning performs close to ceiling on detecting the presence of a licensor, but is inconsistent at scope detection. Tellingly, the BoW baseline is also able to perform at ceiling on telling whether a licensor is present. For BoW to succeed at this task, the GloVe embeddings for NPI-licensors must share some common property, most likely the fact that licensors co-occur with NPIs. It is possible that BERT is able to succeed using a similar strategy. By contrast, identifying whether an NPI is in the scope of a licensor requires at the very least word order information and not just co-occurrences.
The contrast in BERT's performance on the gradient and absolute tasks tells us that these evaluations reveal different aspects of BERT's knowledge. The gradient task is strictly easier than the absolute task. On the one hand, BERT's high performance on the gradient task reveals the presence of systematic knowledge in the NPI domain. On the other hand, due to ceiling effects, the gradi- ent task fails to reveal actual differences between NPI-licensing environments that we clearly observe based on absolute, cloze, and probing tasks.
While BERT has systematic knowledge of acceptability contrasts, this knowledge varies across different licensing environments and is not categorical. Generative models of syntax (Chomsky, 1965(Chomsky, , 1981(Chomsky, , 1995 model human knowledge of natural language as categorical: In that sense BERT fails at attaining human performance. However, some have argued that acceptability is inherently gradient (Lau et al., 2017), and results from the human validation study on our generated dataset show evidence of gradience in the acceptability of sentences in our dataset.
Supplementing BERT with additional pretraining on CCG and MNLI does not improve performance, and even lowers performance in some cases. While results from Phang et al. (2018) lead us to hypothesize that intermediate pretraining might help, this is not what we observe on our data. This result is in direct contrast with the results from Wilcox et al. (2019), who find that syntactic pretraining does improve performance in the NPI domain. This difference in findings is likely due to differences in models and training procedure, as their model is an RNN jointly trained on language modeling and parsing over the much smaller Penn Treebank (Marcus et al., 1993).
Future studies would benefit from employing a variety of different methodologies for assessing model performance withing a specified domain. In particular, a result showing generally good performance for a model should be regarded as possibly hiding actual differences in performance that a different task would reveal. Similarly, generally poor performance for a model does not necessarily mean that the model does not have systematic knowledge in a given domain; it may be that an easier task would reveal systematicity.

Conclusion
We have shown that within a well-defined domain of English grammar, evaluation of sentence encoders using different tasks will reveal different aspects of the encoder's knowledge in that domain. By considering results from several evaluation methods, we demonstrate that BERT has systematic knowledge of NPI licensing. However, this knowledge is unequal across the different features relevant to this phenomenon, and does not reflect the Boolean effect that these features have on acceptability.