On the Robustness of Language Encoders against Grammatical Errors

We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.


Introduction
Pre-trained language encoders have achieved great success in facilitating various downstream natural language processing (NLP) tasks (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019b). However, they usually assume training and test corpora are clean and it is unclear how the models behave when confronted with noisy input. Grammatical error is an important type of noise since it naturally and frequently occurs in natural language, especially in spoken and written materials from non-native speakers. Dealing with such a noise reflects model robustness in representing language and grammatical knowledge. It would also have a positive social impact if language encoders can model texts from non-native speakers appropriately.
Recent work on evaluating model's behaviors against grammatical errors employs various methods, including (1) manually constructing minimal edited pairs on specific linguistic phenomena (Marvin and Linzen, 2018;Goldberg, 2019;Warstadt et al., 2019a,b); (2) labeling or creating acceptability judgment resources (Linzen et al., 2016;Warstadt and Bowman, 2019;Warstadt et al., 2019a); and (3) simulating noises for a specific NLP task such as neural machine translation (Lui et al., 2018;Anastasopoulos, 2019), sentiment classification (Baldwin et al., 2017). These studies either focus on specific phenomena and mainly conduct experiments on designated corpora or rely heavily on human annotations and expert knowledge in linguistics. In contrast, our work automatically simulates natural occurring data and various types of grammatical errors and systematically analyzes how these noises affect downstream applications. This holds more practical significance to understand the robustness of several language encoders against grammatical errors.
Specifically, we first propose an effective approach to simulating diverse grammatical errors, which applies black-box adversarial attack algorithms based on real errors observed on NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013), a grammatical error correction benchmark. This approach transforms clean corpora into corrupted ones and facilitates debugging language encoders on downstream tasks. We demonstrate its flexibility by evaluating models on four language understanding tasks and a sequence tagging task. We next quantify model's capacities of identifying grammatical errors by probing individual layers of pre-trained encoders through a linguistic acceptability task. We construct separate datasets for eight error types. Then, we freeze encoder layers and add a simple classifier on top of each layer to predict the correctness of input texts and locate error positions. This probing task assumes if a simple classifier behaves well on a designated type of error, then the encoder layer is likely to contain knowledge of that error (Conneau et al., 2017;Adi et al., 2017).
Finally, we investigate how models capture the interaction between grammatical errors and contexts. We use BERT as an example and design an unsupervised cloze test to evaluate its intrinsic functionality as a masked language model (MLM).
Our contributions are summarized as follows: 1. We propose a novel approach to simulating various grammatical errors. The proposed method is flexible and can be used to verify the robustness of language encoders against grammatical errors. 2. We conduct a systematic analysis of the robustness of language encoders and enhance previous work by studying the performance of models on downstream tasks with various grammatical error types. 3. We demonstrate: (1) the robustness of existing language encoders against grammatical errors varies; (2) the contextual layers of language encoders acquire stronger abilities in identifying and locating grammatical errors than token embedding layers; and (3) BERT captures the interaction between errors and specific tokens in context, in particular the neighboring tokens of errors. The code to reproduce our experiments are available at: https://github.com/uclanlp/ ProbeGrammarRobustness 2 Related Work Probing Pre-trained Language Encoders The recent success of pre-trained language encoders across a diverse set of downstream tasks has stimulated significant interest in understanding their advantages. A portion of past work on analyzing pre-trained encoders is mainly based on clean data. As mentioned in Tenney et al. (2019a), these studies can be roughly divided into two categories: (1) designing controlled tasks to probe whether a specific linguistic phenomenon is captured by models (Conneau et al., 2018;Peters et al., 2019;Tenney et al., 2019b;Liu et al., 2019a;Kim et al., 2019), or (2) decomposing the model structure and exploring what linguistic property is encoded (Tenney et al., 2019a;Jawahar et al., 2019;Clark et al., 2019). However, these studies do not analyze how grammatical errors affect model behaviors.
Our work is related to studies on analyzing models with manually created noise. For example, Linzen et al. (2016) evaluate whether LSTMs capture the hierarchical structure of language by using verbal inflection to violate subject-verb agreement. Marvin and Linzen (2018) present a new dataset consisting of minimal edited pairs with the opposite linguistic acceptability on three specific linguistic phenomena and use it to evaluate RNN's syntactic ability. Goldberg (2019) adjusts previous method to evaluate BERT. Warstadt et al. (2019a) further compare five analysis methods under a single phenomenon. Despite the diversity in methodology, these studies share common limitations. First, they employ only a single or specific aspects of linguistic knowledge; second, their experiments are mainly based on constructed datasets instead of real-world downstream applications. In contrast, we propose a method to cover a broader range of grammatical errors and evaluate on downstream tasks. A concurrent work (Warstadt et al., 2019b) facilitates diagnosing language models by creating linguistic minimal pairs datasets for 67 isolate grammatical paradigms in English using linguistcrafted templates. In contrast, we do not rely heavily on artificial vocabulary and templates.

Synthesized Errors
To evaluate and promote the robustness of neural models against noise, some studies manually create new datasets with specific linguistic phenomena (Linzen et al., 2016;Marvin and Linzen, 2018;Goldberg, 2019;Warstadt et al., 2019a). Others have introduced various methods to generate synthetic errors on clean downstream datasets, in particular, machine translation corpora. Belinkov and Bisk (2018); Anastasopoulos (2019) demonstrate that synthetic grammatical errors induced by character manipulation and word substitution can degrade the performance of NMT systems. Baldwin et al. (2017) augment original sentiment classification datasets with syntactically (reordering) and semantically (word substitution) noisy sentences and achieve higher performance. Our method is partly inspired by Lui et al. (2018), who synthesize semi-natural ungrammatical sentences by maintaining confusion matrices for five simple error types.
Another line of studies uses black-box adversarial attack methods to create adversarial examples for debugging NLP models (Ribeiro et al., 2018;Jin et al., 2019;Alzantot et al., 2018;Burstein et al., 2019). These methods create a more challenging scenario for target models compared to the above data generation procedure. Our proposed simulation benefits from both adversarial attack algorithms and semi-natural grammatical errors.

Method
We first explain how we simulate ungrammatical scenarios. Then, we describe target models and the evaluation design.

Grammatical Error Simulation
Most downstream datasets contain only clean and grammatical sentences. Despite that recent language encoders achieve promising performance, it is unclear if they perform equally well on text data with grammatical errors. Therefore, we synthesize grammatical errors on clean corpora to test the robustness of language encoders. We use a controllable rule-based method to collect and mimic errors observed on NUCLE following previous work (Lui et al., 2018;Sperber et al., 2017) and apply two ways to introduce errors to clean corpora: (1) we sample errors based on the frequency distribution of NUCLE and introduce them to plausible positions; (2) inspired by the literature of adversarial attacks (Ribeiro et al., 2018;Jin et al., 2019;Alzantot et al., 2018), we conduct search algorithms to introduce grammatical errors that causing the largest performance drop on a given downstream task.
Mimic Error Distribution on NUCLE We first describe how to extract the error distribution on NUCLE (Dahlmeier et al., 2013). NUCLE is constructed with naturally occurring data (student essays at NUS) annotated with error tags. Each ungrammatical sentence is paired with its correction that differs only in local edits. The two sentences make up a minimal edited pair. An example is like: 1. Will the child blame the parents after he growing up? × 2. Will the child blame the parents after he grows up? NUCLE corpus contains around 59,800 sentences with average length 20.38. About 6% of tokens in each sentence contain grammatical errors. There are 27 error tags, including Prep (indicating preposition errors), ArtOrDet (indicating article or determiner errors), Vform (indicating incorrect verb form) and so forth.
We consider eight frequently-occurred, tokenlevel error types in NUCLE as shown in Table 1.
These error types perturb a sentence in terms of syntax (SVA, Worder), semantics (Nn, Wchoice, Trans) and both (ArtOrDet, Prep, Vform), and thus cover a wide range of noise in natural language. Then, we construct a confusion set for each error type based on the observation on NUCLE. Each member of a confusion set is a token. We assign a weight w ij between token t i and t j in the same set to indicate the probability that t i will be replaced by t j . In particular, for ArtOrDet, Prep and Trans, the confusion set consists of a set of tokens that frequently occur as errors or corrections on NUCLE. For each token t i in the set, we compute w ij based on how many times t i is replaced by t j in minimal edited pairs on NUCLE.
Notice that we add a special token ø to represent deletion and insertion. For Nn, when we find a noun, we add it and its singular (SG) or plural (PL) counterpart to the set. For SVA, when we find a verb with present tense, we add it and its third-person-singular (3SG) or non-third (not 3SG) counterpart to the set. For Worder, we exchange the position of an adverb with its neighboring adjective, participle or modal. For Vform, we use NLTK (Bird and Loper, 2004) to extract present, past, progressive, and perfect tense of a verb and add to the set. For Wchoice, we select ten synonyms of a target word from WordNet. The substitution weight is set to be uniform for both Vform and Wchoice.
Grammatical Error Introduction We introduce errors in two ways. The first is called probabilistic transformation. Similar to Lui et al. (2018), we first obtain the parse tree of the target sentence using the Berkeley syntactic parser (Petrov et al., 2006). Then, we sample an error type from the error type distribution estimated from NUCLE and randomly choose a position that can apply this type of error according to the parse tree. Finally, we sample an error token based on the weights from the confusion set of the sampled error type and introduce the error token to the selected position.
However, probabilistic transformation only represents the average case. To debug and analyze the robustness of language encoders, we consider another more challenging setting -worst-case transformation, where we leverage search algorithms  { on, in, at, from, for, under, over, with, into, during, until, against, among, throughout, to, by, about, like, before, across, behind, but, out, up, after  from the black-box adversarial attack to determine error positions. More concretely, we obtain an operation set for each token in a sentence by considering all possible substitutions based on all confusion sets. Note that some confusion sets are not applicable, for example the confusion set of Nn to a verb. Each operation in the operation set is to replace the target token or to change its position. Then, we apply a searching algorithm to select operations from these operation sets that change the prediction of the tested model and apply them to generate error sentences. Three search algorithms are considered: greedy search, beam search, and genetic algorithm. Greedy search attack is a two-step procedure. First, we evaluate the importance of tokens in a sentence. The importance of a token is represented by the likelihood decrease on the model prediction when it is deleted. The larger the decrease is, the more important the token is. After comparing all tokens, we obtain a sorted list of tokens in descending order of their importance. Then, we walk through the list. For each token in the list, we try out all operations from the operation set associated with that token and then practice the operation that degrades the likelihood of the model prediction the most. We keep repeating step two until the prediction changes or a budget (e.g., number of operations per sentence) is reached.
Beam search is similar to greedy search. The only difference is that when we walk through the sorted list of tokens, we maintain a beam with fixed size k that contains the top k operation streams with the highest global degradation.
Genetic algorithm is a population-based iterative method for finding more suitable examples. We start by randomly selecting operations to build a generation and then use a combination of crossover and mutation to find better candidates. We refer the readers to Alzantot et al. (2018) for details of the genetic algorithm in adversarial attack. Comprehensive descriptions of all methods are found in Appendix C.

Target Models
We evaluate the following three pre-trained language encoders. Detailed descriptions of models and training settings are in Appendix B.
ELMo (Peters et al., 2018) is a three-layer LSTM-based model pre-trained on the bidirectional language modeling task on 1B Word Benchmark (Chelba et al., 2014). We fix ELMo as a contextual embedding and add two layers of BiLSTM with attention mechanism on top of it.
BERT (Devlin et al., 2019) is a transformerbased (Vaswani et al., 2017) model pre-trained on masked language modeling and next sentence prediction tasks. It uses 16GB English text and adapts to downstream tasks by fine-tuning. We use BERTbase-cased for Named Entity Recognition (NER) and BERT-base-uncased for other tasks and perform task-specific fine-tuning.
RoBERTa (Liu et al., 2019b) is a robustly pretrained BERT model using larger pre-training data (160GB in total), longer pre-training time, the dynamic masking strategy and other optimized pre-training methods. We use RoBERTa-base and perform task-specific fine-tuning.

Evaluation Methods
We design the following three evaluation methods to systematically analyze how language encoders are affected by grammatical errors in input.

Simulate Errors on Downstream Tasks
Using the simulation methods discussed in Section §3.1, we are able to perform evaluation on existing benchmark corpora. In our experiments, we consider the target models independently. The whole procedure is: given a dataset, the target model is first trained (fine-tuned) and evaluated on the clean training and development set. Then, we discard those wrongly predicted examples from the development set and apply simulation methods to perturb each remaining example. We compute the attack success rate (attacked examples / all examples) as an indicator of model robustness against grammatical errors. The smaller the rate is, the more robust a model is.
Linguistic Acceptability Probing We design a linguistic acceptability probing task to evaluate each individual type of error. We consider two aspects: (1) if the model can tell whether a sentence is grammatically correct or not (i.e., a binary classification task); (2) if the model can locate error positions in the token-level. We fix the target model and train a self-attention classifier to perform both probing tasks.
Cloze test for BERT We design an unsupervised cloze test to evaluate the masked language model component of BERT based on minimal edited pairs. For each minimal pair that differs only in one token, we quantify how the probability of predicting a single masked token in the rest of the sentence affected by this grammatical error. This method analyzes how error token affects clean context, which is complementary to Goldberg (2019) who focuses on SVA error and discusses how clean contexts influence the prediction of the masked error token.

How Grammatical Errors Affect
Downstream Performance?
In this section, we simulate grammatical errors and analyze performance drops on downstream tasks. We compare ELMo, BERT, RoBERTa and a baseline model InferSent (Conneau et al., 2017).  Attack Settings For all tasks, we limit the maximum percentage of allowed modifications in a sentence to be 15% of tokens, which is a reasonable rate according to the statistics estimated from the real data. As shown in Table 3, the worst-case transformation only modifies around 9% of tokens overall under such a limitation. For MNLI and QNLI, we only modify the second sentence, i.e., hypothesis and answer, respectively. For MRPC, we only modify the first sentence. We do not apply the genetic algorithm to MNLI and QNLI due to their relatively large number of examples in the development sets, which induce an extremely long time for attacking. For NER, we keep the named entities and only modify the remaining tokens. Table 2 presents the test performance of four target models on the standard development set of each task. Table 3 summarizes the attack success rates on language understanding tasks, the decreases of F1 score on NER, and the mean percentage of modified tokens (number in brackets). All numbers are formatted in percentage. As shown in Table 3, with the probabilistic transformation, the attack success rates fall between 2% (RoBERTa, QNLI) and 10% (ELMo, MRPC). With the worst-case transformation, we obtain the highest attacked rate of 81.1% (ELMo, genetic algorithm, MRPC) and an average attacked rate across all tasks of 29% by perturbing only around 9% of tokens. This result confirms that all models are influenced by ungrammatical inputs. NER task is   in general harder to be influenced by grammatical errors. In terms of the probabilistic transformation, the drop of F1 scores ranges from 2% to 4%. For the worst-case transformation, the highest drop for NER is 18.33% (ElMo, beam search).

Results and Discussion
Considering different target models, we observe that the impact of grammatical errors varies among models. Specifically, RoBERTa exhibits a strong robustness against the impact of grammatical errors, with consistently lower attack success rates (20.28% on average) and F1 score decreases (17.50% on average) across all tasks, especially on MRPC and MNLI. On the other hand, BERT, ELMo, and InferSent experience an average attack rate of 26.03%, 33.06%, 36.07% respectively on NLU tasks. Given the differences in pre-training strategies, we speculate that pre-training with more data might benefit model robustness against noised data. This speculation is consistent with (Warstadt et al., 2019b), where the authors also give a lightweight demonstration on LSTM and Transformer-XL (Dai et al., 2019) with varying training data. We leave a further exploration of this speculation and a detailed analysis of model architecture to future work.
Note that in the experiment setting, for each model, we follow the literature to compute the attack success rate only on the instances where the model makes correct predictions. Therefore, the attack success rates across different models are not comparable. To compare the robustness of different encoders, we further examine the attack success rates on the common part in the development set that all the models make correct predictions. We find that the overall trend is similar to that in Table  3. For example, the greedy attack success rates of RoBERTa, BERT, and ELMo on MRPC and SST-2 are 14.4%, 22.1%, 46.8%, and 28.2%, 30.0%, 33.9% respectively.
To better understand the effect of grammatical errors, we also analyze (1) which error type harms the performance most, (2) how different error rates affect the performance. For the first question, we represent the harm of an error type by the total time it is chosen in successful greedy attack examples. We conduct experiments to analyze BERT and RoBERTa on the development sets of MRPC, MNLI-m, and SST-2 as shown in Table  0  4. Among all, Wchoice is the most harmful type while Worder the least. SVA ranks the second most harmful type. Notice that though Nn changes a token in a similar way with SVA (both adding or dropping -s or -es in most cases), they have different influences to the model. As for errors related to function words, Prep plays a more important role in general but ArtOrDet harms MNLI more.
For the second one, we increase the allowed modifications of greedy attack from 15% to 45% of tokens in one sentence, resulting the actual percentage of modified tokens under 20%. We evaluate all models on the development set of MNLI-m. Results are shown in Fig 1. We find that all attack success rates grow almost linearly as we increase modifications. ELMo and BERT perform almost the same while InferSent grows faster at the beginning and RoBERTa grows slower when reaching the end. The average attack success rate comes to 70% when the error rate is around 20%.

To What Extent Models Identify
Grammatical Errors?
Our goal in this section is to assess the ability of the pre-trained encoders in identifying grammatical errors. We use a binary linguistic acceptability task to test the model ability in judging the grammatical correctness of a sentence. We further study whether the model can precisely locate error positions, which reflects the token-level ability.
Data We construct separate datasets for each specific type of grammatical error. For each dataset, we extract 10,000 sentences whose lengths fall within 10 to 60 tokens from 1B Word Benchmark (Chelba et al., 2014). Then, we introduce the target error type to half of these sentences using probabilistic transformation and keep the error rate over each dataset around 3% (resulting in one or two  Models We study individual layers of ELMo (2 layers), BERT-base-uncased (12 layers) and RoBERTa-base (12 layers). In particular, we fix each layer and attach a trainable self-attention layer on top of it to obtain a sentence representation. The sentence representation is fed into a linear classifier to output the probability of whether the sentence is linguistically acceptable. See details about the selfattention layer and the linear classifier in Appendix B.3. We next extract the top two positions with the heaviest weights from the trained self-attention layer. If the positions with error token are included, we consider the errors are correctly located by the model in the token-level. This suggests whether contextual encoders are providing enough information for the classifier to identify error locations. For comparisons, we also evaluate the input embedding layer (non-contextualized, layer 0) of each model as a baseline. We compute accuracy for both sentence-level and token-level evaluations.

Results and Discussion
We visualize the results of four layers of BERT on four error types, ArtOrDet, Nn, SVA, and Worder in Fig 2. Complete results of all layers and other error types are in Appendix D. We find that the mean sentencelevel accuracy of the best contextual layers of BERT, ELMo, and RoBERTa across error types are 87.8%, 84.3%, and 90.4%, respectively, while input embedding layers achieve 64.7%, 65.8%, and 66.0%. In token-level, despite trained only on the prediction of whether a sentence is acceptable, the mean accuracy of classifiers upon the best layers of BERT, ELMo, and RoBERTa are 79.3%, 63.3%, and 80.3%, compared to 48.6%, 18.7%, and 53.4% of input embedding layers. The two facts indicate that these pre-trained encoder layers possess stronger grammatical error detecting and locating abilities compared to input embedding layers.
We also observe patterns related to a specific model. Specifically, middle layers (layers 7-9) of BERT are better at identifying errors than lower or higher layers, as shown in Fig 2. But higher layers of BERT locate errors related to long-range dependencies and verbs such as SVA and Vform more accurately. To further investigate BERT's knowledge of error locations. We conduct the same token-level evaluation to the 144 attention heads in BERT. Results for Prep and SVA are visualized in Fig 3. We find that even in a completely unsupervised manner, some attention heads results for 50%-60% accuracy in locating errors. Consistent with self-attention layers, attention heads from middle layers perform the best. See Appendix F for all error types.
Due to space limit, we present results of RoBERTa and ELMo in Appendix D and summarize the observations in the following. RoBERTa exhibits better ability in detecting and locating errors in lower layers compared to BERT and achieves the best performance in top layers (layers 10-11). It is also good at capturing verb and dependency errors. On the other hand, the first layer of ELMo consistently gives the highest sentencelevel classification accuracy. But its best performing layer in locating errors depends on the error type and varies between the first and the second layer. In particular, The second layer of ELMo exhibits strong ability in locating Nn and outperforms BERT in accuracy. This is surprising given the fact that Nn is not obvious with character embeddings Each row represents a target error type. Each column represents the distance from the error position. Each number represents the mean likelihood drop over all pairs. We find that specific tokens are affected more by error tokens.
from layer 0 of ELMo. We further notice that for all models, Worder is the hardest type to detect in the sentence-level and ArtOrDet and Worder are the hardest types to locate in the token-level. We hypothesize this is related to the locality of these errors which induces a weak signal for models to identify them. Appendix E demonstrates some examples of the token-level evaluation of BERT.

How BERT Captures the Interaction between Tokens When Errors Present
We aim to reveal the interaction between grammatical errors and their nearby tokens through studying the masked language model (MLM) component of BERT. We investigate BERT as it is a typical transformer-based encoder. Our analysis can be extended to other models.

Experimental Settings
We conduct experiments on minimal edited pairs from NUCLE. We extract pairs with error tags ArtOrDet, Prep, Vt, Vform, SVA, Nn, Wchoice, Trans and keep those that only have one token changed. This gives us eight collections of minimal edited pairs with sizes of 586, 1525, 1817, 943, 2513, 1359, 3340, and 452, respectively. Given a minimal edited pair, we consider tokens within six-token away from the error token. We replace the same token in the grammatical and ungrammatical sentence with [MASK] one at a time and use BERT as an MLM to predict its likelihood. Then we compute the likelihood drop in the ungrammatical sentence and obtain the average drop over all minimal edited pairs. The inexpensive fuel cost and the sheer volume of energy produced by a nuclear reactor far outweighs the cost of research and development .

×
The inexpensive fuel cost and the sheer volume of energy produced by the nuclear reactor far outweighs the cost of research and development . produced by a (the) nuclear reactor 0.05 -0.02 -0.31 0.42 in the presence of errors. Given the fact that certain dependencies between tokens such as subject-verb and determiner-noun dependencies are accurately modeled by BERT as demonstrated in prior work (Jawahar et al., 2019), we suspect that the presence of an error token will mostly affect its neighboring tokens (both in terms of syntactic and physical neighbors). This is consistent with our observation in Fig 4 that in the case of SVA where a subject is mostly the preceding token of a verb (although agreement attractors can exist between subject and verb), the proceeding tokens of error positions get the largest likelihood decreases overall. In the case of ArtOrDet where an article or a determiner can be an indicator and a dependent of the subsequent noun, predicting the next tokens of error positions becomes much harder. We provide two running examples with ArtOrDet in Table 5 to further illustrate this point.

Adversarial Training
Finally, we explore a data augmentation method based on the proposed grammatical error simulations. We apply the greedy search algorithm to introduce grammatical errors to the training examples of a target task and retrain the model on the Results are shown in Figure 5. we find that by adding a small number of adversarial examples, the accuracy is recovered from 46% to 82%. As the proportion of augmented adversarial examples increases, the accuracy continues to increase on the corrupted set, with negligible changes to the original validation accuracy. This fact also demonstrates that our simulated examples are potentially helpful for reducing the effect of grammatical errors.

Conclusion
In this paper, we conducted a thorough study to evaluate the robustness of language encoders against grammatical errors. We proposed a novel method to simulating grammatical errors and facilitating our evaluations. We studied three pre-trained language encoders, ELMo, BERT, and RoBERTa and concentrated on three aspects of their abilities against grammatical errors: performance on downstream tasks when confronted with noised texts, ability in identifying errors and capturing the interaction between tokens in the presence of errors. This study shed light on understanding the behaviors of language encoders against grammatical errors and encouraged future work to enhance the robustness of these models.

A Downstream Task Details
We test on four language understanding and a sequence labeling datasets. Statistics of these datasets are listed in Table 6.
MRPC The Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005) is a paraphrase detection task which aims to predict a binary label for whether two sentences are semantically equivalent.

MNLI
The Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018) is a broad-domain natural language inference task to predict the relation (entailment, contradiction, neutral) between premise and hypothesis. MNLI contains both the matched (in-domain) and mismatched (cross-domain) sections.
QNLI The Question-answering NLI task (QNLI) is recasted from the Stanford Question Answering Dataset (Rajpurkar et al., 2016), which aims to determine whether a context sentence contains the answer to the question (entailment, not entailment).

SST-2
The Stanford Sentiment Treebank twoway class split (SST-2;(Socher et al., 2013)) is a binary classification task which assigns positive or negative labels to movie review sentences.
CoNLL2003 -NER The shared task of CoNLL-2003 Named Entity Recognition (NER) (Sang andMeulder, 2003) is a token level sequence labeling task to recognize four types of named entity: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

Dataset
Train

B.2 Training and Fine-tuning Details
For BERT and RoBERTa, we set the maximum input length to be 128, the maximum number of epochs to be 3, and the dropout to be 0.1 for all tasks. We use Adam (Kingma and Ba, 2015) with an initial learning rate of 2e-5, batch size 16 and no warm-up steps for training. For ELMo, we train the BiLSTM using Adam (Kingma and Ba, 2015) with an initial learning rate of 1e-4, batch size 32. We set the dropout to be 0.2, the maximum number of epochs to be 10 and divide the learning rate by 5 when the performance does not improve for 2 epochs.

B.3 Probing model Details
We use a self-attention layer and a linear classifier to compose the probing component in section 5. The self-attention layer takes as input the hidden representations from the fixed layer i of an encoder, denoted as h = {h i 1 , h i 2 , ..., h i n } and outputs a sentence representation s i : where W a is a weight matrix and v b is a vector of parameters. s i is fed to the classifier to output the probability of the sentence being linguistically acceptable. The self-attention layer has a hidden dim of 100 and 0.1 dropout. The classifier has 1 layer and 0.1 dropout. The probing model is trained with Adam (Kingma and Ba, 2015) using a learning rate of 0.001, batch size of 8 , L 2 weight decay of 0.001 for 10 epochs and early stop patience of 2.

C Attack Algorithms
We conduct three searching algorithms, greedy search, beam search, genetic algorithm in adversarial attacks based on the real errors on NUCLE (Dahlmeier et al., 2013). For beam search, we set the beam size as 5. For genetic algorithm, we set the population in each generation to be 60 and set the maximum number of generations to be 23% of the corresponding sentence length. For example, if a sentence has 100 tokens, the genetic algorithm will iterate for at most 23 iterations. Algorithm 1, 2 and 3 are detailed descriptions of greedy attack, beam search attack, and genetic algorithm attack, respectively. Task   Table 7 shows complete results for probing individual layers of ELMo, BERT, and RoBERTa across eight error types in the sentence-level binary classification task. We fix the parameters of pre-trained encoders and train a self-attention classifier for each layer to judge the binary linguistic acceptability of a sentence. We find that layer 1 of ELMo, middle layers of BERT, and top layers of RoBERTa perform the best in this evaluation. Task   Table 8 shows complete results for probing individual layers of ELMo, BERT, and RoBERTa across eight error types in the token-level. We first fix the parameters of pre-trained encoders and train a self-attention classifier for each layer to judge the binary linguistic acceptability of a sentence. Then, we extract the two positions with the highest attention weights of self-attention layers and see if error tokens are included.

E Case Study of Locating Error Positions
We show some examples of the token-level evaluation in section 5. We randomly select one example for each error type and visualize the attention weights of the self-attention layer upon different layers of BERT. A deeper purple under each token means the self-attention layer is putting more attention on this token.

Algorithm 1 Greedy attack 10cm
Input: Original sentence Xori = {w1, w2, ..., wn}, ground truth prediction Yori, target model F , all confusion sets P , budget b. Output: Adversarial example X adv . 1: Initialization: X adv ← Xori 2: for each wi in Xori do 3: Delete wi and compute the drop of likelihood on Yori to obtain the importance score of wi, denoted as Sw i . 4: Apply all substitutions of P to wi. Obtain the operation pool of wi, denoted as W sub i . 5: end for 6: 7: Get the index list of Xori according to the decrease order of token importance: I ← argsortw i ∈X ori (Sw i ) 8: for each i in I do 9: pori ← F (X adv )|Y =Y ori 10: for each w in W sub i do 11: Substitute wi with w in X adv (or swap their positions), 12: Y adv ← argmaxF (X adv ), p adv ← F (X adv )|Y =Y ori 13: if not Yori = Y adv then return X adv 14: else 15: if p adv < pori then 16: Substitute wi with w select in X adv , 23: end for 24: return Xori F The Token-level Evaluation on Attention Heads of BERT As mentioned in section 5. We also conduct the same token-level probing to 144 attention heads of BERT. In this experiment, the parameters in BERT are completely frozen. We observe that even in this unsupervised manner, some attention heads are still capable of precisely locating error positions. Middle layers of BERT perform the best. We further observe that some attention heads might be associated with specific types of errors. For example, head 2 in layer 9 and head 6 in layer 10 are good at capturing SVA and Vform. Both of these two errors are related to verbs. Randomly select a position j and an operation from W sub j to modify Xori. Then add to P 0 . 7: end for 8: 9: for g = 1, 2, 3, ..., G − 1 do 10: for i = 1, 2, 3, ..., ps do 11: if not Y adv = Yori then return P g−1 i 13: else 14: X elite ← argmin(p adv ) 15: prob ← Normalize sample probability with F (P g−1 i )

17:
for i = 2, 3, ..., ps do 18: Sample parent1 from P g−1 with probs prob 19: Sample parent2 from P g−1 with probs prob 20: child ← Crossover(parent1, parent2) 21: childmut ← Randomly select a position and an operation from W sub j to modify    Figure 6: Visualization of attention weights of self-attention layers. A figure represents a sentence with a specific error type. Errors in a sentence are highlighted in red. Each column represents one layer of BERT that the selfattention layer is build upon.