Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly

Building on Petroni et al. 2019, we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (‘‘Birds cannot [MASK]”) and non-negated (‘‘Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (‘‘Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.


Introduction
PLMs like Transformer-XL , ELMo (Peters et al., 2018) and BERT  have emerged as universal tools that capture a diverse range of linguistic and factual knowledge. Recently, Petroni et al. (2019) introduced LAMA (LAnguage Model Analysis) to investigate whether PLMs can recall factual knowledge that is part of their training corpus. Since the PLM training objective is to predict masked tokens, question answering (QA) tasks can be reformulated as cloze questions. For example, "Who wrote 'Dubliners'?" is reformulated as " [MASK] wrote 'Dubliners'." In this setup, Petroni et al. (2019) show that PLMs outperform automatically extracted knowledge bases on QA. In this paper, we investigate this capability of PLMs in the context of (1) negation and what we call (2) mispriming.
(1) Negation. To study the effect of negation on PLMs, we introduce the negated LAMA dataset. We insert negation elements (e.g., "not") in LAMA cloze questions (e.g., "The theory of relativity was not developed by [MASK].") -this gives us positive/negative pairs of cloze questions.
Querying PLMs with these pairs and comparing the predictions, we find that the predicted fillers have high overlap. Models are equally prone to generate facts ("Birds can fly") and their incorrect negation ("Birds cannot fly"). We find that BERT handles negation best among PLMs, but it still fails badly on most negated probes. In a second experiment, we show that BERT can in principle memorize both positive and negative facts correctly if they occur in training, but that it poorly generalizes to unseen sentences (positive and negative). However, after finetuning, BERT does learn to correctly classify unseen facts as true/false.
(2) Mispriming. We use priming, a standard experimental method in human psychology (Tulving and Schacter, 1990) where a first stimulus (e.g., "dog") can influence the response to a second stimulus (e.g., "wolf" in response to "name an animal"). Our novel idea is to use priming for probing PLMs, specifically mispriming: we give automatically generated misprimes to PLMs that would not mislead humans. For example, we add "Talk? Birds can [MASK]" to LAMA where "Talk?" is the misprime. A human would ignore the misprime, stick to what she knows and produce a filler like "fly". We show that, in contrast, PLMs are misled and fill in "talk" for the mask.
We could have manually generated more natural misprimes. For example, misprime "regent of Antioch" in "Tancred, regent of Antioch, played a role in the conquest of [MASK]" tricks BERT into chosing the filler "Antioch" (instead of "Jerusalem"). Our automatic misprimes are less natural, but automatic generation allows us to create a large misprime dataset for this initial study.
Contribution. We show that PLMs' ability to learn factual knowledge is -in contrast to human capabilities -extremely brittle for negated sentences and for sentences preceded by distracting material (i.e., misprimes). Data and code will be published. 1

Data and Models
LAMA's cloze questions are generated from subject-relation-object triples from knowledge bases (KBs) and question-answer pairs. For KB triples, cloze questions are generated, for each relation, by a templatic statement that contains variables X and Y for subject and object (e.g, "X was born in Y"). We then substitute the subject for X and MASK for Y. In a question-answer pair, we MASK the answer.
Negated LAMA. We created negated LAMA by manually inserting a negation element in each template or question. For ConceptNet we only consider an easy-to-negate subset (see appendix).
Misprimed LAMA. We misprime LAMA by inserting an incorrect word and a question mark at the beginning of a statement; e.g., "Talk?" in "Talk? Birds can [MASK]." We only misprime questions that are answered correctly by BERTlarge. To make sure the misprime is misleading, we manually remove correct primes for SQuAD and ConceptNet and automatically remove primes that are the correct filler for a different instance of the same relation for T-REx and ConceptNet. We create four versions of misprimed LAMA (A, B, C, D) as described in the caption of Table 3; Table 1 gives examples.

Results
Negated LAMA. Table 2 gives spearman rank correlation ρ and % overlap in rank 1 predictions between original and negated LAMA.
Our assumption is that the correct answers for a pair of positive question and negative question 1 https://github.com/norakassner/LAMA primed negated  should not overlap, so high values indicate lack of understanding of negation. The two measures are complementary and yet agree very well. The correlation measure is sensitive in distinguishing cases where negation has a small effect from those where it has a larger effect. 2 % overlap is a measure that is direct and easy to interpret. In most cases, ρ > 85%; overlap in rank 1 predictions is also high. ConcepNet results are most strongly correlated but TREx 1-1 results are less overlapping. Table 4 gives examples (lines marked "N"). BERT has slightly better results. Google-RE date of birth is an outlier because the pattern "X (not born in [MASK])" rarely occurs in corpora and predictions are often nonsensical.
In summary, PLMs poorly distinguish positive and negative sentences.
We give two examples of the few cases where PLMs make correct predictions, i.e., they solve the cloze task as human subjects would. For "The capital of X is not Y" (TREX, 1-1) top ranked predictions are "listed", "known", "mentioned" (vs. cities for "The capital of X is Y"). This is appropriate since the predicted sentences are more common than sentences like "The capital of X is not Paris". For "X was born in Y", cities are predicted, but 2 A reviewer observes that spearman correlation is generally high and wonders whether high spearman correlation is really a reliable indicator of negation not changing the answer of the model. As a sanity check, we also randomly sampled, for each query correctly answered by BERT-large (e.g., "Einstein born in [MASK]"), another query with a different answer, but the same template relation (e.g., "Newton born in [MASK]") and computed the spearman correlation between the predictions for the two queries. In general, these positive-positive spearman correlations were significantly lower than those between positive ("Einstein born in [MASK]") and negative ("Einstein not born in [MASK]") queries (t-test, p < 0.01). There were two exceptions (not significantly lower): T-REx 1-1 and Google-RE birth-date.  Table 2: PLMs do not distinguish positive and negative sentences. Mean spearman rank correlation (ρ) and mean percentage of overlap in first ranked predictions (%) between the original and the negated queries for Transformer-XL large (Txl), ELMo original (Eb), ELMo 5.5B (E5B), BERT-base (Bb) and BERT-large (Bl).
for "X was not born in Y", sometimes countries are predicted. This also seems natural: for the positive sentence, cities are more informative, for the negative, countries.
Balanced corpus. Investigating this further, we train BERT-base from scratch on a synthetic corpus. Hyperparameters are listed in the appendix. The corpus contains as many positive sentences of form "x j is a n " as negative sentences of form "x j is not a n " where x j is drawn from a set of 200 subjects S and a n from a set of 20 adjectives A. The 20 adjectives form 10 pairs of antonyms (e.g., "good"/"bad"). S is divided into 10 groups g m of 20. Finally, there is an underlying KB that defines valid adjectives for groups. For example, assume that g 1 has property a m = "good". Then for each x i ∈ g 1 , the sentences "x i is good" and "x i is not bad" are true. The training set is generated to contain all positive and negative sentences for 70% of the subjects. It also contains either only the positive sentences for the other 30% of subjects (in that case the negative sentences are added to test) or vice versa. Cloze questions are generated in the format "x j is [MASK]"/"x j is not [MASK]". We test whether (i) BERT memorizes positive and negative sentences seen during training, (ii) it generalizes to the test set. As an example, a correct generalization would be "x i is not bad" if "x i is good" was part of the training set. The question is: does BERT learn, based on the patterns of positive/negative sentences and within-group regularities, to distinguish facts from non-facts.  Table 3: Absolute precision drop (from 100%, lower better) when mispriming BERT-large for the LAMA subset that was answered correctly in its original form. We insert objects that (A) are randomly chosen, (B) are randomly chosen from correct fillers of different instances of the relation (not done for SQuAD as it is not organized in relations), (C) were top-ranked fillers for the original cloze question but have at least a 30% lower prediction probability than the correct object. (D) investigates the effect of distance, manipulating (C) further by inserting a concatenation of 20 neutral sentences (e.g., "Good to know.", see appendix) between misprime and cloze question.
finetune BERT ("finetuned BERT") on the task of classifying sentences as true/false, its test accuracy is 100%. (Recall that false sentences simply correspond to true sentence with a "not" inserted or removed.) So BERT easily learns negation if supervision is available, but fails without it. This experiment demonstrates the difficulty of learning negation through unsupervised pretraining. We suggest that the inability of pretrained BERT to distinguish true from false is a serious impediment to accurately handling factual knowledge.
Misprimed LAMA. Table 3 shows the effect of mispriming on BERT-large for questions answered correctly in original LAMA; recall that Table 1 gives examples of sentences constructed in modes A, B, C and D. In most cases, mispriming with a highly ranked incorrect object causes a precision drop of over 60% (C). Example predictions can be found in Table 4     tivity to misprimes still exists when the distance between misprime and cloze question is increased: the drop persists when 20 sentences are inserted (D). Striking are the results for Google-RE where the model recalls almost no facts (C). Table 4 (lines marked "M") shows predicted fillers for these misprimed sentences. BERT is less but still badly affected by misprimes that match selectional restrictions (B). The model is more robust against priming with random words (A): the precision drop is on average more than 35% lower than for (D). We included the baseline (A) as a sanity check for the precision drop measure. These baseline results show that the presence of a misprime per se does not confuse the model; a less distracting misprime (different type of entity or a completely implausible answer) often results in a correct answer by BERT.

Discussion
Whereas Petroni et al. (2019)'s results suggest that PLMs are able to memorize facts, our results indicate that PLMs largely do not learn the meaning of negation. They mostly seem to predict fillers based on co-occurrence of subject (e.g., "Quran") and filler ("religious") and to ignore negation.
A key problem is that in the LAMA setup, not answering (i.e., admitting ignorance) is not an option. While the prediction probability generally is somewhat lower in the negated compared to the positive answer, there is no threshold across cloze questions that could be used to distinguish valid positive from invalid negative answers (cf. Table 4).
We suspect that a possible explanation for PLMs' poor performance is that negated sentences occur much less frequently in training corpora. Our synthetic corpus study (Table 5) shows that BERT is able to memorize negative facts that occur in the corpus. However, the PLM objective encourages the model to predict fillers based on similar sentences in the training corpus -and if the most similar statement to a negative sentence is positive, then the filler is generally incorrect. However, after finetuning, BERT is able to classify truth/falseness correctly, demonstrating that negation can be learned through supervised training.
The mispriming experiment shows that BERT often handles random misprimes correctly (Table 3 A). There are also cases where BERT does the right thing for difficult misprimes, e.g., it robustly attributes "religious" to Quran (Table 4). In general, however, BERT is highly sensitive to misleading context (Table 3 C) that would not change human behavior in QA. It is especially striking that a single word suffices to distract BERT. This may suggest that it is not knowledge that is learned by BERT, but that its performance is mainly based on similarity matching between the current context on the one hand and sentences in its training corpus and/or recent context on the other hand. Poerner et al. (2019) present a similar analysis.
Our work is a new way of analyzing differences between PLMs and human-level natural language understanding. We should aspire to develop PLMs that -like humans -can handle negation and are not easily distracted by misprimes.
A wide range of literature analyzes linguistic knowledge stored in pretrained embeddings (Jumelet and Hupkes, 2018;Gulordava et al., 2018;Giulianelli et al., 2018;Dasgupta et al., 2018;Marvin and Linzen, 2018;Kann et al., 2019). Our work analyzes factual knowledge.  show that BERT finetuned to perform natural language inference heavily relies on syntactic heuristics, also suggesting that it is not able to adequately acquire common sense.  investigate BERT's understanding of how negative polarity items are licensed. Our work, focusing on factual knowledge stored in negated sentences, is complementary since grammaticality and factuality are mostly orthogonal properties. Kim et al. (2019) investigate understanding of negation particles when PLMs are finetuned. In contrast, our focus is on the interaction of negation and factual knowledge learned in pretraining. Ettinger (2019) defines and applies psycho-linguistic diagnostics for PLMs. Our use of priming is complementary. Their data consists of two sets of 72 and 16 sentences whereas we create 42,867 negated sentences covering a wide range of topics and relations. Ribeiro et al. (2018) test for comprehension of minimally modified sentences in an adversarial setup while trying to keep the overall semantics the same. In contrast, we investigate large changes of meaning (negation) and context (mispriming). In contrast to adversarial work (e.g., (Wallace et al., 2019)), we do not focus on adversarial examples for a specific task, but on pretrained models' ability to robustly store factual knowledge.

Conclusion
Our results suggest that pretrained language models address open domain QA in datasets like LAMA by mechanisms that are more akin to relatively shallow pattern matching than the recall of learned factual knowledge and inference.
Implications for future work on pretrained language models. (i) Both factual knowledge and logic are discrete phenomena in the sense that sentences with similar representations in current pretrained language models differ sharply in factuality and truth value (e.g., "Newton was born in 1641" vs. "Newton was born in 1642"). Further architectural innovations in deep learning seem necessary to deal with such discrete phenomena. (ii) We found that PLMs have difficulty distinguishing "informed" best guesses (based on information extracted from training corpora) from "random" best guesses (made in the absence of any evidence in the training corpora). This implies that better confidence assessment of PLM predictions is needed. (iii) Our premise was that we should emulate human language processing and that therefore tasks that are easy for humans are good tests for NLP models. To the extent this is true, the two phenomena we have investigated in this paper -that PLMs seem to ignore negation in many cases and that they are easily confused by simple distractors -seem to be good vehicles for encouraging the development of PLMs whose performance on NLP tasks is closer to humans. Figure 1: Training loss and test accuracy when pretraining BERT-base on a balanced corpus. The model is able to memorize positive and negative sentences seen during training but is not able to generalize to an unseen test set for both positive and negative sentences.
batch size 32 learning rate 4e-5 number of epochs 20 max. sequence length 7 Table 7: Hyper-parameters for finetuning on the task of classifying sentences as true/false.

A.2 Details on the balanced corpus
We pretrain BERT-base from scratch on a corpus on equally many negative and positive sentences. We concatenate multiples of the same training data into one training file to compensate for the little amount of data. Hyper-parameters for pretraining are listed in Table 6. The full vocabulary is 349 tokens long. Figure 1 shows that training loss and test accuracy are uncorrelated. Test accuracy stagnates around 0.5 which is not more than random guessing as for each relation half of the adjectives hold. We finetune on the task of classifying sentences as true/false. We concatenate multiples of the same training data into one training file to compensate for the little amount of data. Hyperparameters for finetuning are listed in Table 7.
We use source code provided by Wolf et al. (2019) 4 .