An Analysis of Natural Language Inference Benchmarks through the Lens of Negation

Negation is underrepresented in existing natural language inference benchmarks. Additionally, one can often ignore the few negations in existing benchmarks and still make the right inference judgments. In this paper, we present a new benchmark for natural language inference in which negation plays an important role. We also show that state-of-the-art transformers struggle making inference judgments with the new pairs.


Introduction
Natural language understanding remains an elusive goal except in limited scenarios. It is arguably the ultimate problem in natural language processing: to empower machines to understand language as generated by humans. The state of the art has seen tremendous progress in recent years, and has moved from symbolic representations (Bos et al., 2004;Artzi and Zettlemoyer, 2013) to distributional representations often learned from massive datasets (Devlin et al., 2019). Recognizing entailments (Dagan et al., 2006), identifying paraphrases (Das and Smith, 2009), determining semantic textual similarity (Agirre et al., 2012), and sentiment analysis (Pang and Lee, 2008) are but a few problems that require natural language understanding to a lesser or greater degree.
There are many benchmarks targeting the problems above, and they usually cast them as classification problems. A couple of popular evaluation platforms, GLUE (Wang et al., 2018) and Super-GLUE (Wang et al., 2019), aggregate benchmarks for some of the problems above and provide a single score for many tasks under the umbrella of natural language inference. State-of-the-art models are close to or even surpass human performance (Wang et al., 2019). This fact, however, is true only when evaluating models and humans with existing benchmarks. Indeed, researchers have pointed out weaknesses in benchmarks suggesting that we are evaluating models with examples that are much simpler than what humans are capable of (Section 3). Source text selection, annotation artifacts (Gururangan et al., 2018), and asking annotators-either experts or crowd workers-to write examples as opposed to retrieving real examples from previously generated language are a few of the culprits.
In this paper, we investigate the role of negation in a core natural language understanding task: natural language inference-in its most basic form, determining whether a text entails a hypothesis. Recognizing entailments has many applications including question answering (Trivedi et al., 2019), summarization (Pasunuru et al., 2017) and machine translation evaluation (Padó et al., 2009).
Negation relates an expression e to another expression with a meaning that is in some way opposed to the meaning of e (Horn and Wansing, 2017), thus it plays an important role in natural language understanding. Additionally, negation is ubiquitous in regular English texts: approximately 25% of English sentences contain negation depending on the domain and genre (Section 4). Despite these facts, negation is underrepresented and mostly irrelevant in existing benchmarks-one can literally disregard the negations and still make correct inference judgments in popular datasets. The work presented here addresses these shortcomings and makes the following contributions: 1 1. We show that negation is underrepresented and often irrelevant in existing benchmarks. 2. We create new benchmarks for natural language inference in which negation plays a critical role to make inference judgments. 3. We demonstrate that state-of-the-art trans-formers trained with the original benchmarks are not robust when negation is present. 4. We provide empirical evidence that transformers may be unable to learn the intricacies of negation in the most challenging benchmark, which includes longer texts from many genres.

Background
The task of natural language inference or recognizing textual entailment consists in determining whether a hypothesis is true given a text. The original task considers two labels: entailment or no entailment (Dagan et al., 2006), and a newer formulation considers three labels: entailment, contradiction or neutral (Giampiccolo et al., 2007). For example, the text "A person on a horse jumps over an airplane" entails hypothesis "A person is outdoors, on a horse," contradicts "A person is at a diner, ordering an omelette," and is neutral with respect to "A person is training his horse for a competition." We work with three existing benchmarks: a collection of RTE datasets (Dagan et al., 2006;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), SNLI (Bowman et al., 2015 and MNLI (Williams et al., 2018). The RTE datasets are smaller (5,767 text-hypothesis pairs) than SNLI and MNLI (569,033 and 431,997 pairs). MNLI is more challenging than RTE and SNLI: texts are longer and were selected from 10 genres including fiction and non-fiction as well as conversation transcripts. On the other hand, the texts in SNLI were selected from image captions. The hypotheses in SNLI and MNLI were crowdsourced, i.e., manually generated by non-experts. Tables 2 and 4 show examples in the RTE, SNLI and MNLI benchmarks. We work with the formatted versions of these datasets in the GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks for convenience.

Previous Work
Previous work has revealed weaknesses with the benchmarks we work with and that adversarial examples can break models for many natural language processing tasks. Adversarial examples consist of arguably trivial modifications to inputs that trick computational models. Some of them include misspellings (Pruthi et al., 2019), syntactically controlled paraphrases (Iyyer et al., 2018), lexical substitutions (Alzantot et al., 2018), and more elaborate substitutions (Ribeiro et al., 2018). More re-cently, Ribeiro et al. (2020) propose CHECKLIST, a task-agnostic strategy for testing NLP models. Their strategy can be used to identify which linguistic capabilities a model lacks. For example, they show that commercial systems for sentiment analysis are not robust when negation is present.
Regarding natural language inference, Poliak et al. (2018) show that models taking into account only hypotheses significantly outperform majority baselines, and Gururangan et al. (2018) discuss annotation artifacts, e.g., negation cues (not, never, etc.) are a strong indicator of contradictions. Glockner et al. (2018) show that models trained with SNLI fail to resolve new pairs that require simple lexical substitution, e.g., holding a saxophone contradicts holding an electric guitar. Naik et al. (2018) conclude that models are not robust to negation, but their only test is concatenating the tautology "and false is not true" to hypotheses. Wallace et al. (2019) introduce universal triggers and show that concatenating negation cues to SNLI hypotheses decreases accuracy to almost zero when the gold label is entailment or neutral.
The task of identifying paraphrases consists in determining whether two sentences have the same meaning, and can be casted-at least from a definitional perspective-as recognizing bidirectional entailments. Pruthi et al. (2019) show that computational models underperform in MRPC (Dolan et al., 2004) with adversarial misspellings, and Kovatchev et al. (2019) present a qualitative analysis of 11 state-of-the-art models (overall accuracies: 68-84%). When negation is present, however, accuracies drop to 33% (6 models) 67% (4 models) and 1% (1 model). Finally, Zhang et al. (2019) present a dataset for paraphrase identification including adversarial sentence pairs that are not paraphrases but have high word overlap. The new pairs helps training models robust to word scrambling.
The aforecited works do not investigate the role of negation in depth. Regarding paraphrase identification, previous work only has shown that models underperform with negation. Regarding natural language inference and negation, previous work considers negations only in the hypotheses-not the texts. Additionally, they only work with unrealistic negations that do not require models to do anything but ignore the negations. Indeed, they concatenate tokens including negations cues that are label-preserving and unrelated to the original texts and hypotheses. Unlike them, we (a) show #sents.

Negation in English and Natural Language Inference Benchmarks
Negation is pervasive in English (Morante and Sporleder, 2012), although there is limited empirical evidence from previous work (Councill et al., 2010;Elkin et al., 2005). In order to conduct a large-scale analysis and compare how often negation is present in English and existing natural language inference benchmarks, we employ a negation cue detector using a Bi-LSTM neural architecture with an additional CRF layer (Hossain et al., 2020). Trained and tested with CD-SCO, a corpus publicly available , it obtains 0.92 F1. The supplemental materials provide more details regarding the architecture of the negation cue detector and the negation cues it detects. Table 1 details the percentage of sentences with at least one negation in several large generalpurpose English corpora. We work with online reviews (Wan et al., 2019;Maas et al., 2011), conversations (Chang et al., 2019, Wikipedia (50,000 pages with at least 20 views), 500 books from Project Gutenberg (Lahiri, 2014), and OntoNotes (Hovy et al., 2006) as released by Pradhan et al. (2011). The percentage of sentences containing negation is high: it ranges from 8.69% to 29.92% in all corpora, and is over 17% in all but Wikipedia. We note that negation is pervasive across domains and genres, including informal texts such as online reviews and both oral and written conversations (22.64-29.92%). Perhaps surprisingly, the percentage is very high in books (28.45%). Table 1 also presents the percentage of sentences with negation in the three natural language inference benchmarks. Negation is clearly underrepresented in all of them except MNLI. These percentages do not invalidate the benchmarks. They show, however, that SNLI and RTE do not account for intricate linguistic phenomena such as negations. The reason for the low percentage in SNLI is that it uses texts from picture captions (Section 3), and captions describe pictures with affirmative statements (see examples in Tables 2 and 4).
The Role of Negation in Existing Natural Language Inference Benchmarks We conduct a manual qualitative analysis in order to (a) characterize the negations in RTE, SNLI and MNLI, and (b) assess how critical negation is to solve the few text-hypothesis pairs that include at least one negation in these benchmarks. We conduct the analysis with 100 text-hypothesis pairs containing negation from each benchmark (300 pairs total). From a linguistic perspective, most negations: • are particles (no, not, n't, etc.) whose only function is to indicate negation (RTE: 62%, SNLI: 60%, MNLI: 84%), • grammatically modify a verb (RTE: 62%, SNLI: 55%, MNLI: 81%), and • scope over the main predicate (RTE: 52%, SNLI: 53%, MNLI: 62%). These percentages are roughly uniformly distributed across labels.
In addition to looking at the negation cues in isolation, we also analyze the role of negation in making judgments. The first key distinction is whether dropping the negation changes the inference judgment (entailment or no entailment in RTE; and entailment, neutral or contradiction in SNLI and MNLI). If it does not, we say the negation is unimportant (important otherwise). The Example RTE 1) T: Mr Lopez Obrador, who lost July's presidential election by less than one percentage point, declared himself Mexico's "legitimate" president. H: Mr Lopez Obrador didn't loose the presidential election in July.
2) T: If toxic waste containg cyanide is not disposed of properly, it may drain into ponds, streams, sewers, and reservoirs. H: Leaks into environment are caused by bad disposal of toxic waste containing cyanide.
3) T: Toshiba has produced a fuel cell with no moving parts. H: Toshiba has no moving parts. MNLI 7) T: It was summertime the air conditioner was on the door was closed and i couldn't knock because i had to hold the jack with the other hand i finally with my elbow rang the doorbell and mother came to the door. H: The wintertime is when the air conditioning was on, I couldn't ring the doorbell because it was frozen. 8) T: It runs advertisements for its supporters at the top of shows and strikes business deals with MCI, TCI, and Disney, but still insists it's not commercial. H: It runs ads for its supporters at shows and strikes business deals, but insists it is not commercial. second key distinction is whether the negation is aligned, i.e., whether there is a semantic alignment between what is negated in the text (or hypothesis) and a chunk of the hypothesis (or text). We further identify negated alignments, i.e., alignments in which the alignment is also negated. Table 2 exemplifies this classification with the three benchmarks. Regarding SNLI, the negation in the hypothesis of Example (4) is important: landed entails not moving, at least according to the SNLI annotators, who were describing pictures thus (presumably) couldn't really tell if the plane was (a) completely stopped or taxiing after landing (and thus still moving). The negation in the text of Example (5), however, is unimportant: A man with no shirt on is performing with a baton entails A man is doing things with a baton regardless of whether the man has a shirt. Simply put, the negation plays no role in making the correct inference judgment. In Examples (4) and (6), the negations align but in Example (5), the negation does not align. Specifically, the alignments of the negations in the text and hypothesis of Example (6) are negated: homeless aligns with does not have a home, and both are negated. The alignment of the negation in the hypothesis of Pair (4), on the other hand, is not negated: not moving aligns with landed, and the latter is not negated. The categorization of the negations in text-hypothesis pairs from RTE and MNLI examples is as follows: • RTE. The negation in the hypothesis of Example (1) is important, and it aligns but the alignment is not negated (didn't looselost). In Example (2), the negation in the text falls under the same categories: important and aligned, and the alignment is not negated (not disposed of properlybad disposal). In Example (3), on the other hand, the negations are unimportant and aligned, in fact, there is an identical (and negated) alignment (no moving parts in both the text and hypothesis). • MNLI. The negation in the text of Example (7), I couldn't knock, is unimportant and not aligned. Indeed, the first clause in both the text and hypothesis, which do not contain negation, are sufficient to solve the pair: the air conditioning being on in wintertime is not entailed by the air conditioning being on in summertime. The negation in the hypothesis of Example (7), however, is also unimportant but aligned (I couldn't ring the doorbellmy elbow rang the doorbell), although the alignment is not negated. This negation is unimportant for the same reason: one can make the correct inference judgment disregarding the negation altogether. The negations in Example (8) Table 3: Analysis of the few negations in the text-hypothesis pairs from the three natural language inference corpora we work with (RTE: 7.16% of pairs, SNLI: 1.19%, MNLI: 22.63%; Table 1). E stands for entailment, ¬E for no entailment, C for contradiction and N for neutral. Many negations are unimportant, i.e., one can ignore them and still make the correct inference judgment. Table 3 presents the analysis of the role of negation based on these categories. First, we note that one can often ignore negations without consequences: 76% of negations are unimportant in RTE, 48% in SNLI and 52% in MNLI. In RTE, negations are unimportant in text-hypothesis pairs regardless of the inference judgment (75-76%). In SNLI and MNLI, however, negations are almost always unimportant in neutral text-hypothesis pairs (93% in SNLI and 83% in MNLI), and they tend to be unimportant when the text entails the hypothesis (78% in MNLI and 38% in SNLI). Second, we note that few negations align in RTE (entailment: 25%, no entailment: 17%), but about half of them align in SNLI and MNLI (55% and 53%). The percentage of aligned negations heavily depends on the inference judgment in SNLI and MNLI, and in RTE to a lesser degree (entailment is 50% more likely). More interestingly, whether the alignment is negated is a clear sign of the inference judgment. In RTE, the alignments are rarely negated in no entailment pairs (4% overall, 23.5% of aligned pairs), but that is not the case with entailment pairs (15% overall, 60% of aligned pairs). In SNLI, the differences are larger: 40.3% of aligned pairs labeled entailment are negated. We observe a similar pattern in the negations from MNLI: alignments are rarely negated in contradictions (2.6% of aligned pairs), and most alignments are negated in entailment pairs (66.7% of aligned pairs).

A Benchmark for Natural Language Understanding with Negation
We create new benchmarks in which negation plays an important role for natural language inference. The starting points are the original benchmarks, more specifically, we selected at random 500 texthypothesis pairs from RTE, SNLI and MNLI (1,500 text-hypothesis pairs total). We work with pairs from the training and development splits as GLUE and SuperGLUE do not include gold labels for some test splits. Then, we follow three steps for each of the selected original pairs. In the remaining of the paper, we use T and H to refer to texts and hypotheses in RTE, SNLI and MNLI. These steps result in 4,500 new pairs and their judgments (3 per original pair, 1,500 from each RTE, SNLI and MNLI). Note that the negations are rather simple-adding not to the main verb, and adding auxiliaries and fixing verb tense if neededbut are realistic in the sense that the resulting texts and hypotheses follow proper English grammar. Additionally, the new pairs including negation are not more difficult than the original pairs except for the presence of negation. In particular, they do not require additional lexical inference and the overall topic described does not change.  T neg : Thursday's judge, the Honorable Charles Adams of the Coconino County Superior Court, did not agree, but highly discouraged self-representation. H: Self-representation was encouraged by the Honorable Charles Adams.
H neg : Self-representation was not encouraged by the Honorable Charles Adams. Judgments: T-H: contradiction, T neg -H: contradiction, T-H neg : entailment, T neg -H neg : entailment Table 4: Examples of original pairs and new pairs generated after we manually introduce negation. Note that we (a) generate three new pairs after combining texts and hypotheses with and without negation (T-H is the original pair), and (b) manually annotate inference judgments for the three new pairs. hypothesis, the three new text-hypothesis pairs may receive different inference judgments (in particular the judgments for T neg -H and T neg -H neg are the opposite). The same is true across text-hypothesis pairs including negation and generated from different natural language inference benchmarks. For example, the text entails the hypothesis in the first examples shown from SNLI and MNLI, but the three new pairs including negation receive different judgments: neutral, contradiction and neutral; and contradiction, contradiction and entailment). The second examples created from SNLI and MNLI show the same phenomenon but with an original T-H pair labeled contradiction.
Annotation Process and Agreements. Three annotators and an additional adjudicator did the annotations described above in two phases.
In the first phase, the three annotators added negation to the main verbs of texts and hypotheses (Step 1). After a short training session, we decided to have only one annotator add negation in each original pair as the task is relatively straightforward. Any issues in this phase were detected during Phase 2. Text-hypothesis pairs with issues were discarded (only 5%) and additional pairs were collected to account for the discarded pairs (and still have 1,500 text-hypothesis pairs including negation and generated from each of the three benchmarks, 4,500 new text-hypothesis pairs in total).
In the second phase, the three annotators read the new pairs including negation (automatically generated in Step 2: T neg -H, T-H neg and T neg -H neg ) and manually labeled them with inference judgments (Step 3). In this phase, each pair was annotated by two annotators independently, and the adjudicator resolved any disagreements. We calculated inter-annotator agreement prior to adjudication using Cohen's κ (Cohen, 1960). κ coefficients were 0.85 (RTE), 0.81 (SNLI) and 0.72 (MNLI). κ coefficients between 0.6 and 0.8 are considered substantial, and between 0.8 and 1.0 nearly perfect (Artstein and Poesio, 2008 (Table 3).
Label Distributions. The original RTE, SNLI and MNLI benchmarks contain, by design, texthypothesis pairs with roughly uniform judgment distributions. Thus, the majority baseline obtains roughly 50% accuracy in RTE (2 labels) and 33% in SNLI and MNLI (3 labels).
Our new benchmarks including negation do not have a uniform judgment distribution (Table 5), although the pairs generated from MNLI are close (entailment: 24.8%, contradiction: 35.9%, and neutral: 39.3%). We acknowledge that the label distribution in the new pairs generated from RTE (majority baseline: 78.9%) and, to a certain degree, SNLI (majority baseline: 56.5%) are not as challenging as the label distributions in the original pairs. As we shall see in Section 6, however, our experiments show that the ones from MNLI are a challenge for state-of-the-art transformers.
The Role of Negation. Table 6 presents the analysis of the role of negation in the new benchmarks using the categories presented in Section 4. We analyze 100 text-hypothesis pairs generated from each original benchmark (RTE, SNLI and MNLI). There are less unimportant negations in our new benchmarks than in the original corpora (Table 3). While many negations in the new pairs generated from RTE are unimportant (entailment: 52%, no entailment: 56%), few negations in the pairs generated from SNLI and MNLI are unimportant, especially when the text entails or contradicts the hypothesis (SNLI: 17% and 24%, MNLI: 12% and 19%). Unsurprisingly, the percentage of aligned negations is higher in our corpus due to the steps we use to introduce negation, especially with in the new pairs generated from RTE (62% vs. 20%).

Experiments and Results
In order to assess whether state-of-the-art systems can solve the task of natural language inference when negation is present, we experiment with three state-of-the art transformers: BERT (Devlin et al., 2019), XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019). We use the implementation and pretrained models by Wolf et al. (2019), and tune them to solve each benchmark. The supplemental materials provide details about (a) the hyperparameter settings we use to fine-tune these transformers, and (b) other implementation decisions.
We conduct two experiments. First, we assess whether these transformers tuned with the original train splits in RTE, SNLI and MNLI are capable of solving our new benchmarks including negation (Section 6.1). Second, we investigate if tuning with the new text-hypothesis pairs including negation improves the results (Section 6.2).    (Yang et al., 2019) and [3] for BERT (Devlin et al., 2019). None of the transformers benefit from training with a portion of the pairs that include negation when tested with MNLI, which contains longer and more diverse text-hypothesis pairs.

Training with Existing Benchmarks
Can transformers solve the new text-hypothesis pairs including negation if trained with existing benchmarks? No, they cannot (Table 7). Indeed, the three transformers obtain worse results with the new pairs including negation, especially with SNLI (≈50% drop with the three transformers). These results might be unsurprising with SNLI and RTE since the original text-hypothesis pairs included few negations (1.19% and 7.16%, Table 1). The pattern is also true, however, with MNLI: we observe relative drops ranging from 23.0 to 24.2% despite 22.63% of text-hypothesis contain a negation in MNLI (Table 1). Comparing with the results obtained with the majority baseline, we observe that the transformers do not learn to solve pairs with negation unless they are tuned with pairs including negation (Section 6.2). Indeed, all of them obtain worse results than the majority baseline in RTE and SNLI, but not in MNLI. We make a couple additional observations from the results in Table 7. First, the transformers solve the few text-hypothesis pairs including negation in the original benchmarks (dev neg ) as good (SNLI, MNLI) or better (RTE) than all pairs (dev). In other words, as our analysis of the role of negation in existing benchmarks points out (Section 4), negations do not bring additional complexity in these benchmarks. Second, RoBERTa and XLNet obtain roughly the same results with the new pairs including negation, but BERT falls slightly behind.

Fine-Tuning with New Pairs Containing Negation
Can transformers solve the new text-hypothesis pairs including negation if retrained with some of the new pairs including negation? Only to a certain degree: with SNLI, they benefit but underperform with respect to the original pairs; and with MNLI, they only benefit slightly. In order to investigate whether the transformers can learn to make inference judgments when negation must be considered, we divide the new text-hypothesis pairs containing negation into training (70%) and test (30%) splits. Table 8 shows the results obtained with the new test split and the three transformers trained with (a) the training split in the original benchmarks and (b) the training split in the original benchmarks combined with the training split with pairs containing negation. We observe that the transformers only learn to solve the new pairs including negation in the latter training scenario, but only partially. Indeed, we only observe a large improvement (59.3-64.4% vs. 83.8-88.2%) with the new pairs generated from RTE, which are also the only pairs that obtain higher accuracies than the original development split (83.8-88.2% vs. 66.1-75.8%). With the new pairs generated from SNLI, there is a substantial improvement after fine-tuning (43.3-53.1% vs. 69.1-75.3%) but the three transformers still obtain substantially worse results than with the original development split (69.1-75.3% vs. 89.9-91.6%). Finally, the transformers only benefit marginally from fine-tuning with the new pairs including negation and generated from ). Similar to the results obtained with pairs generated from SNLI, the transformers obtain substantially worse results than with the original development split in ). These results lead to the conclusion that natural language inference when negation is present remains an unsolved challenge.

Conclusions
Negation is ubiquitous in English and critical to understand language and make inferences, as it denies or inverts meaning. Despite these facts, negation is underrepresented in some natural language inference benchmarks (RTE and SNLI). Additionally, one can ignore negation and still make the correct inference judgment with many text-hypothesis pairs in existing natural language inference benchmarks (RTE, SNLI and MNLI).
In this paper, we have presented a new benchmark of text-hypothesis pairs containing negation (4,500 pairs). We generate and annotate these pairs after systematically adding negation to the main verb of the texts and hypotheses-either one or both-from RTE, SNLI and MNLI thus they are as difficult to solve as the original pairs except for the presence of negation. State-of-the art transformers trained with the original training splits from RTE, SNLI and MNLI obtain much worse results results with the new benchmark than with the original pairs-including the few original texthypothesis pairs that do contain negation. In addi-tion, our experimental results show that transformers struggle even after fine-tuning with new pairs containing negation.  8  8  32  32  32  32  32  32  Learning rate  2e-5 2e-5 2e-5  1e-5 1e-5 1e-5  2e-5 2e-5 2e-5  Epochs  10  50  50  3  3  3  3  3  3  Weight Figure 1: The BiLSTM-CRF architecture to identify negation cues. The input is a sentence. Each token is the concatenation of the word and its universal part-of-speech tag. The model outputs a sequence of labels indicating negation presence (S C, P C, SF C or N C). The example input sentence is "Holmes/NOUN would/VERB not/ADV listen/VERB to/ADP such/ADJ fancies/NOUN ,/PUNCT and/CCONJ I/PRON am/VERB his/DET agent/NOUN." e.g., inconsistent), SF C (suffixal negation, e.g., emotionless), and N C (not a cue). Training details. We merge the train and development instances from CD-SCO, and use 85% of the result as training and the remaining 15% as development. We evaluate our cue detector with the original test split from CD-SCO. We use the stochastic gradient descent algorithm with RMSProp optimizer (Tieleman and Hinton, 2012) for tuning weights. We set the batch size to 32, and the dropout and recurrent dropout are set to 30% for the LSTM layers. We stop the training process after the accuracy in the development split does not increase for 20 epochs, and the final model is the one which yields the highest accuracy in the development accuracy during the training process (not necessarily the model from the last epoch The neural model has nearly 4.3 million parameters and takes 30 minutes on average to train on a CPU machine (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz) with 64 GB of RAM.

B Fine-tuning Hyperparameters for State-of-the-Art Systems
For all the Transformer models, we set the maximum sequence length to 128. We use the Hugging Face implementation and pretrained models (Wolf et al., 2019). We work with the default settings for most of the hyperparameters except a few used to fine-tune to each benchmark. Table 9 shows the fine-tuned hyperparameters for the 3 transformers. Also, we use the base architectures for all the transformers (12-layer, 768-hidden, 12-heads).