Understanding by Understanding Not: Modeling Negation in Language Models

Negation is a core construction in natural language. Despite being very successful on many tasks, state-of-the-art pre-trained language models often handle negation incorrectly. To improve language models in this regard, we propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences from a raw text corpus. By training BERT with the resulting combined objective we reduce the mean top 1 error rate to 4% on the negated LAMA dataset. We also see some improvements on the negated NLI benchmarks.


Introduction
Negation is an important property in many language understanding tasks, such as sentiment analysis, question answering, knowledge base completion and natural language inference (Kassner and Schütze, 2019;Naik et al., 2018). While Pretrained Language Models (PLMs) such as BERT pushed the state-of-the-art on these tasks (Devlin et al., 2019;Petroni et al., 2019), they fail dramatically on instances that require understanding negation.
Kassner and Schütze (2019) show that current PLMs cannot correctly distinguish between the negated and non-negated forms of fill-in-the-blank tests. For instance, when asked to predict the [MASK] token in sentences such as "The capital of Cuba is [MASK]" and "The capital of Cuba is not [MASK]", BERT often generate the same answer "Havana", indicating that it may not be appropriately modeling the distribution of negative sentences. Additional evidence is given by the fact that, when fine-tuned on natural language inference tasks, PLMs tend to mis-classify examples which A generic sentence is negated using our data augmentation method and an unlikelihood token is chosen and replaced with [MASK]. This new sentence is concatenated with the original sentence and fed into the model. The unlikelihood loss is computed using p(improvements) from the language modeling head of BERT. contain not or no as contradiction when the true label is neutral or entailment (Naik et al., 2018). Recently, Hossain et al. (2020b) proposed new natural language inference test sets to specifically target the model's understanding of negation and show that current state-of-the-art models perform poorly on these test sets.
In this work, we investigate whether we can alleviate the modeling bias of PLMs on negated sentences. Our approach is composed of two core contributions: i) a syntactic data augmentation scheme to automatically generate negated sentences; ii) a new training paradigm, dubbed unlikelihood training with reference ( Fig. 1), based on the recently proposed unlikelihood training (Welleck et al., 2020).
At first, we generate a large number of negated sentences by negating sentences mined from an openly available text corpus (Wikipedia). Our sentence negator uses the dependency parse of the sentence, part of speech tags, and morphological features of each word in the sentence and deterministically negates the sentence. Given a negated version of a sentence, we replace its object with the [MASK] token and use unlikelihood training to make the object unlikely under the PLM distribution (e.g. we minimize the probability of "improvements" as depicted in Fig. 1). Importantly, in order to ensure that the negated sentence is factually false, we use the positive sentence as context (i.e., as a reference) for the unlikelihood prediction task. Concretely, we provide the concatenation of the positive and the masked negated sentence as input to the PLM. Our method can be thought of a type data augmentation, which has be shown to be effective at improving robustness across many tasks in language, such as text classification (Wei and Zou, 2019), natural language inference (Min et al., 2020;McCoy et al., 2019) and semantic parsing (Andreas, 2019).
For our negation experiments, we fine-tune pretrained BERT with our new objective and a knowledge distillation objective. We test our model on the negated LAMA dataset (Kassner and Schütze, 2019), which is the negated version of knowledge probing dataset LAMA, introduced in Petroni et al. (2019). Our model achieves a mean error rate of 4% (a improvement of 5 points) on the negated LAMA dataset while maintaining the performance on the original LAMA dataset without any direct training on the negated LAMA sentences. We also finetune BERT on RTE (Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), SNLI (Bowman et al., 2015 and MNLI (Williams et al., 2018) tasks and achieve better results on the language inference benchmark including negation from (Hossain et al., 2020b).

Related Work
Pre-trained language models have shown impressive results across many tasks, such as question answering (Alberti et al., 2019) and natural language inference (Liu et al., 2019). These models are also known to encode factual and common-sense knowledge (Radford et al., 2019;Petroni et al., 2019;Bosselut et al., 2019). Despite these abilities, Kass-ner and Schütze (2019) found that these models fail at understanding negation through analysing negated factual statements.
Noji and Takamura (2020) propose taking advantage of negative examples and unlikelihood in the training of language models to increase their syntactic abilities. Similarly, Min et al. (2020) show the effectiveness of syntactic data augmentation in the case of robustness in NLI. Neither of these works focus on negations.

Syntactic Negation Augmentation
We generate the negated versions of sentences using a syntactic augmentation method. The method gets as input the dependency parse of the sentence, POS tags and morphological information of each word and negates the sentence using a set of rules. Each rule has a dependency tree regular expression pattern (Semgrex;Chambers et al. 2007). We use Semgrex patterns to identify different syntactic templates, and then transform the sentence based on a list of actions defined in the rule. These actions can be move, replace, insert and lemmatize. The unlikelihood token which will be discussed later is also chosen using Semgrex patterns (see Appendix C for some examples).
We use Stanza (Qi et al., 2020) to get the dependency parse of the sentences, parts of speech tags, lemma, and morphological features of the words. We also filter out sentences with more than 20 words.
To test the coverage of our Semgrex patterns, we randomly sampled 930 sentences from Wikipedia. Only 31 of them did not match any of our Semgrex patterns (See  of matches for each rule in our rule set for these 930 sentences). In addition, to get a better sense of the correctness of our method, 100 random sentences (from Wikipedia) were negated and reviewed by a native English speaker. The precision for these negations is 94.00%. Applying unlikelihood to a word in any random sentence is problematic, unless the sentence is a factual statement (e.g. unlikelihood on improvements in "He did not advocate navigational improvements on the Sangamon River." in Fig 1 is problematic as this sentence is not grounded in reality). Moreover, using solely factual sentences limits the application of this method. 1 To be able to use any generic (not necessarily factual) sentence and pick an unlikelihood token in it, there needs to be some sort of grounding or context. In this setup, each training example is of the form <sentence A, sentence B> where sentence A is the reference for sentence B, and provides the grounding or context for it.

Unlikelihood and knowledge distillation
The unlikelihood loss has recently been proposed by Welleck et al. (2020) to mitigate the problem of repetition in neural text generation. Noji and Takamura (2020) also adopted this loss to penalize the desirability of an incorrect token in a sentence.
We adopt this method to penalize the likelihood of a token in sentence B that makes this sentence contradictory with the reference sentence A.
(1) A Humans have a rational soul. B Humans do not have a rational soul.
In the example 1, assuming that sentence A is true, we want the model to avoid assigning "soul" in sentence B a high probability. To this end, the probability of the unlikelihood token x u = "soul" is penalized with the unlikelihood loss L U L as: where x 1:T is the whole input sequence (sentence A concatenated with sentence B which is the negated version of sentence A as illustrated in Fig 1). To have a balanced augmentation data set, we also include examples where sentence B is the copy of sentence A and therefore not contradictory with it.
In this context, we want the model to perform as it was untouched (before any fine-tuning). The KL divergence knowledge distillation loss is used for these examples on the same token: (2) A Humans have a rational soul. B Humans have a rational [MASK].
The loss L KL for token x l = "[MASK]" is written as: where p LM is the probability distribution over the vocabulary for the masked token x l under the LM before any fine-tuning.
In our experiments, we use the BERT-base model and further train it with two objectives, the unlikelihood objective (Eq. 1) and the knowledge

Query
Top 3 words with log probs from BERT Top 3 words with log probs from BERTNOT iOS is developed by [MASK].
The majority of the amazon forest is in [MASK].
Paris (-2.4), office (-2.7), France (-2.8) vain (-3.5), error (-4.0), doubt (-4.5) Mac OS is developed by [MASK]. distillation objective (Eq. 2). We also use original Wikipedia sentences for the latter to prevent catastrophic forgetting of language modeling. The probability of the unlikelihood token p(x u |x 1:T ) and the distribution for masked token x l are computed using the language modeling head of the BERT model by replacing x u and x l in the input sequences with the [MASK] token. Examples for each objective are sampled uniformly. We will refer to our model as BERTNOT.

Experiments
We report our main results on LAMA and Negated LAMA for knowledge base completion. The cloze statements from LAMA are facts or commonsense knowledge generated from either subject-relationobject triples (X, rel, Y) or question-answers pairs. The cloze statements for the triples are generated using a template for each relation which includes the placeholders X and Y (e.g. "X is located in Y"). X is replaced for the subject and Y is replaced with the [MASK] token to be predicted by the model. In the question-answer pairs, the answer is replaced with [MASK] token. The facts in the LAMA dataset are from multiple sources: 1) Google-RE relations, namely "place of birth", "date of birth" and "place of death"; 2) T-REx,  (2019) we use mean precision at k (P @ k) for LAMA. For negated LAMA we report mean top 1 error rate.

Knowledge Base Completion
As discussed in section 4.2, we train a pre-trained BERT base cased model for 5 epochs, with 20k examples for each objective, a maximum sequence length of 128 and a learning rate of 1e-5. To see the effects of the unlikelihood objective more clearly, we also train a pre-trained BERT base cased model with only the KL knowledge distillation objective with the same data and hyper-parameters. Tables 1 and 2 respectively show the mean precision at rank 1 (averaged over all the relations) for LAMA, and mean top 1 error rate for negated LAMA queries. 2 The mean error rate on the negated LAMA queries decreases to below 4% while the results on original LAMA stay the same. These results are achieved without any direct training on LAMA queries (negated or non-negated). Table 3 shows the top 3 predicted words for a pretrained BERT model and the model trained with our method. Pre-trained BERT seems to ignore negation and mostly predict based on the subject of the query, but the prediction probability in the negated queries seems to be generally lower. Our method is as good as the vanilla model (BERT) on original queries. For the negated queries, our model predictions are far-superior than the vanilla model. We also tried out method on BERT-large. See appendix E for results and discussion.

Natural Language Inference
We fine-tune our model with a language inference objective on RTE, SNLI and MNLI tasks.  The prosecutor told the court that the incident had caused "distress" to one of the children.
The prosecutor did not tell the court that "distress" in one of the children is associated with the incident.  Our model achieves superior results on RTE (low-resource setting) and slightly better accuracies on SNLI and MNLI (high-resource setting) on all the new splits containing negation, while keeping roughly the same scores on the original dev splits. We conjecture that fine-tuning on large-amounts of data (SNLI and MNLI) may have resulted in catastrophic forgetting of the negation knowledge, decreasing the gap between BERT and BERTNOT. We tried to alleviate the catastrophic forgetting by mixing in some unlikelihood training and knowledge distillation along the NLI training, but that did not help. You can see these results for MNLI in appendix D. We leave further exploration of better fine-tuning objectives while preserving the pretrained knowledge for future work.  (2020b), along with the predictions from BERT and BERTNOT. Examples 4 and 6 show the failure cases of BERTNOT. As it can be seen, for the fifth example, the true label is incorrect, but BERTNOT predicts the correct label for this pair of premise and hypothesis.

Conclusion
In this work, we propose a combination of the unlikelihood objective with a reference based setup for input sentences to model negation. This allows us to utilize generic sentences, and negate them with our data augmentation method to be used as examples for the unlikelihood objective. Our method notably improves the error rate on the negated LAMA dataset while keeping the same performance on the original LAMA queries.
We also test our method on the original development sets and new splits containing negation from Hossain et al. (2020b) of RTE, SNLI and MNLI tasks. We see large improvements on the negated splits in low-resource setting (RTE) and slight improvements in high-resource setting (SNLI and MNLI), while also maintaining similar results as BERT on original splits.

A Training details
Here are the hyper-parameters used in our fine-tunings.  Algorithm 1: Details of the training procedure of BERTNOT. The unlikelihood loss and knowledge distillation loss are first computed with the <sentence A, sentence B> inputs. These inputs are contradictory for the UL loss, and non-contradictory for knowledge distillation (sec 4.2). We use γ = 0.4 in our experiments to sum these losses and compute the gradient g 1 . Then, we compute the knowledge distillation loss for inputs sampled from Wikipedia. These inputs do not have our reference based format. The parameters are updated again using the gradient from this knowledge distillation loss (g 2 ).

B Examples of negated sentences
Here are some examples and details of our syntactic negation method.

Original
Negated Unlikelihood Token 1 That tournament helped demonstrate the high caliber of play in women's soccer. This was not broadcast live on Norway's main national TV carrier NRK.
Norway 4 The latter may occur implicitly through the use of a construct like DEFVAR or DEFPARAMETER.
The latter may not occur implicitly through the use of a construct like DEFVAR or DEFPARAMETER.
latter 5 When Arjuna was fighting Karna, the latter's chariot's wheels sank into the ground.
When Arjuna was fighting Karna, the latter's chariot's wheels did not sank into the ground. wheels 6 It also prohibits or restricts the use of certain accounts held at financial institutions.
It also does not prohibit or restricts the use of certain accounts held at financial institutions. use    9: Examples of how the syntactic negation augmentation method works. For the first sentence, the matched rule has two actions, move and replace. The move action has moved the token B = did before token A = mention.
The replace action has replaced npiword = Nowhere with an empty token, which means removing this token. The token object = letter is chosen as the unlikelihood token in this sentence.
In the second sentence, the matched rule has three actions, two inserts and one lemmatize action. The insert actions, add the tokens "did not" before A = made, and the token A = made is replaced with its lemma by the lemmatize action. The token object = leg is chosen as the unlikelihood token in the negated sentence.

D Mixing negation unlikelihood training and knowledge distillation with NLI training
In order to reduce the catastrophic forgetting behavior of the model during NLI training, we added the unlikelihood, knowledge distillation and MLM objectives to the original NLI classification objective and trained the model with the same hyper-parameters for the MNLI task. We also trained one version with only the original NLI classification objective and the MLM objective. As the results in   Table 11: Mean precision at k = 1 (p @ 1) for original LAMA queries (higher is better) of BERT with unlikelihood and distillation objectives without references for sentences, BERT-large, and BERT-large with unlikelihood and distillation objectives with different learning rates.  Table 12: Mean top 1 error rate for negated LAMA queries (lower is better) of BERT with unlikelihood and distillation objectives without references for sentences, BERT-large, and BERT-large with unlikelihood and distillation objectives with different learning rates.
As the results in table 12 show, pre-trained BERT-large performs worse than pre-trained BERT-base on negated LAMA queries. We decreased the batch-size to be able to fine-tune BERT-large. As the scores for negated LAMA queries from table 12 show, fine-tuning BERT-large with our method using the same or slightly larger learning rate does not improve the results. We observe a decrease in the mean top 1 error rates for negated LAMA queries when we use a larger learning rate (1e − 5), but this also hinders the performance of the model on the original LAMA queries (table 11). This requires some hyper-parameter tuning and further investigation.