Improving Grammatical Error Correction Models with Purpose-Built Adversarial Examples

A sequence-to-sequence (seq2seq) learning with neural networks empirically shows to be an effective framework for grammatical error correction (GEC), which takes a sentence with errors as input and outputs the corrected one. However, the performance of GEC models with the seq2seq framework heavily relies on the size and quality of the corpus on hand. We propose a method inspired by adversarial training to generate more meaningful and valuable training examples by continually identifying the weak spots of a model, and to enhance the model by gradually adding the generated adversarial examples to the training set. Extensive experimental results show that such adversarial training can improve both the generalization and robustness of GEC models.


Introduction
The goal of Grammatical Error Correction (GEC) is to identify and correct different kinds of errors in the text, such as spelling, punctuation, grammatical, and word choice errors, which has been widely used in speech-based dialogue, web information extraction, and text editing software.
A popular solution tackles the grammatical error correction as a monolingual machine translation task where ungrammatical sentences are regarded as the source language and corrected sentences as the target language (Ji et al., 2017;Chollampatt and Ng, 2018a). Therefore, the GEC can be modeled using some relatively mature machine translation models, such as the sequence-to-sequence (seq2seq) paradigm (Sutskever et al., 2014).
They are many types of grammatical errors and all of them can occur in a sentence, which makes it impossible to construct a corpus that covers all kinds of errors and their combinations. Deep learning so far is data-hungry and it is hard to train a seq2seq model with good performance without suf-Clean His reaction should give you an idea as to whether this matter is or is not your business .

Direct Noise
His reaction should give you an as idea to whether this matter is or so is not your business .

Back Translation
His reaction should give you the idea as to whether this matter is or is not your business .

Adversarial Example
His reaction should gave you an idea as to whether this matter is or is n't their business . Table 1: Example ungrammatical sentences generated by different methods. The direct noise method seems to add some meaningless noises to the original sentence. Most of the grammatical errors generated by the back translation are those produced by replacing prepositions, and inserting or deleting articles. The example generated by our adversarial attack algorithm looks more meaningful and valuable. ficient data. Therefore, recent studies have turned the focus to the methods of generating high-quality training samples (Xie et al., 2018;Lichtarge et al., 2019;Zhao et al., 2019). Generating pseudo training data with unlabeled corpora can be roughly divided into direct noise and back translation (Kiyono et al., 2019). The former applies text editing operations such as substitution, deletion, insertion and shuffle, to introduce noises into original sentences, and the latter trains a clean-to-noise model for error generation. However, the noise-corrupted sentences are often poorly readable, which are quite different from those made by humans. The sentences generated by the back translation also usually cover a few limited types of errors only 1 , and it is difficult for the back translation to generate the errors not occurred in the training set. Although these methods can produce many ungrammatical examples, most of them have little contribution to improving the performance.
We also found that the resulting models are still quite vulnerable to adversarial examples, although they are trained with the data augmented by their methods. Taking a state-or-the-art system of Zhao et al. (2019) on CoNLL-2014(Ng et al., 2014 as an example, we generate adversarial samples by intentionally introducing few grammatical errors into the original clean sentences under a white-box setting 2 . The model's performance of F 0.5 drops from 0.592 to 0.434 if just one grammatical error is added into each sentence, to 0.317 if three errors are added. To our knowledge, we first show in this study that adversarial examples also exist in grammatical error correction models. Inspired by adversarial attack and defense in NLP (Jia and Liang, 2017;Zhao et al., 2017;Cheng et al., 2018), we explore the feasibility of generating more valuable pseudo data via adversarial attack, targeting the weak spots of the models, which can improve both the quality of pseudo data for training the GEC models and their robustness against adversarial attacks. We propose a simple but efficient method for adversarial example generation: we first identify the most vulnerable tokens with the lowest generation probabilities estimated by a pre-trained GEC model based on the seq2seq framework, and then we replace these tokens with the grammatical errors people may make to construct the adversarial examples.
Once the adversarial examples are obtained, they either can be merged with the original clean data to train a GEC model or used to pre-train a model thanks to their coming with great numbers. The examples generated by our method based on the adversarial attack are more meaningful and valuable than those produced by recent representative methods, such as the direct noise and the back translation (see Table 1). Through extensive experimentation, we show that such adversarial examples can improve both generalization and robustness of GEC models. If a model pre-trained with largescale adversarial examples is further fine-tuned by adversarial training, its robustness can be improved about 9.5% while without suffering too much loss (less than 2.4%) on the clean data.

Grammatical Error Correction
The rise of machine learning methods in natural language processing (NLP) has led to a rapid increase in data-driven GEC research. The predominant paradigm for the data-driven GEC is arguably sequence-to-sequence learning with neural networks (Yuan and Briscoe, 2016;Xie et al., 2016;Schmaltz et al., 2017;Ji et al., 2017), which is also a popular solution for machine translation (MT).
Some task-specific techniques have been proposed to tailor the seq2seq for the GEC task. Ji et al. (2017) proposed a hybrid neural model using word and character-level attentions to correct both global and local errors. Zhao et al. (2019) explicitly applied the copy mechanism to the GEC model, reflecting the fact that most words in sentences are grammatically correct and should not be changed. Diverse ensembles (Chollampatt and Ng, 2018a), rescoring (Chollampatt and Ng, 2018b), and iterative decoding (Ge et al., 2018;Lichtarge et al., 2018) strategies also have been tried to tackle the problem of incomplete correction.
Although the advancement of the GEC has made an impressive improvement, the lack of training data is still the main bottleneck. Very recently, data augmentation techniques began to embark on the stage (Xie et al., 2018;Zhao et al., 2019). The GEC models that achieved the competitive performance (Kiyono et al., 2019;Zhao et al., 2019;Lichtarge et al., 2019;Grundkiewicz et al., 2019) were usually pre-trained on large unlabeled corpora and then fine-tuned on the original training set.

Textual Adversarial Attack
Fooling a model by perturbing its inputs, which is also called an adversarial attack, has become an essential means of exploring the model vulnerabilities. To go a further step, incorporating adversarial samples in the training stage, also known as adversarial training, could effectively improve the models' robustness. Depending on the degree of access the target model, adversarial examples can be constructed in two different settings: white-box and black-box settings. An adversary can access the model's architecture, parameters, and input feature representations in the white-box setting while not in the black-box one. The white-box attacks typically yield a higher success rate because the knowledge of target models can guide the genera-tion of adversarial examples. However, the blackbox attacks do not require access to target models, making them more practicable for many real-world attacks.
Textual adversarial attack for adversarial samples has been applied to several NLP tasks such as text classification (Ebrahimi et al., 2017;Samanta and Mehta, 2017;, machine translation (Zhao et al., 2017;Cheng et al., 2018), reading comprehension (Jia and Liang, 2017), dialogue systems , and dependency parsing (Zheng et al., 2020). Text adversarial example generation can be roughly divided into two steps: identifying weak spots and token substitution. Many methods including random selection (Alzantot et al., 2018), trial-and-error testing at each possible point (Kuleshov et al., 2018), analyzing the effects on the model of masking input text (Samanta and Mehta, 2017;Gao et al., 2018;Jin et al., 2019), comparing attention scores , or gradient-based methods (Ebrahimi et al., 2017;Lei et al., 2018;Wallace et al., 2019) have been proposed to select the vulnerable token. The selected tokens then will be replaced with similar ones to change the model prediction. Such substitutes can be chosen from nearest neighbors in embedding spaces (Alzantot et al., 2018;Jin et al., 2019), synonyms in a prepared dictionary (Samanta and Mehta, 2017;Ebrahimi et al., 2017), typos , paraphrases (Lei et al., 2018), or randomly selected ones (Gao et al., 2018).

Baseline Model
We formally define the GEC task and then briefly introduce the seq2seq baseline. As we mentioned above, the GEC can be modeled as an MT task by viewing an ungrammatical sentence x as the source sentence and a corrected one y as the target sentence. Let D = {(x, y)} n be a GEC training dataset. The seq2seq model first encode a source sentence having N tokens into a sequence of context-aware hidden representations h s 1:N , and then decodes the target hidden representations h t i from the representations of h s 1:N . Finally, the target hidden representations can be used to produce the generation probability p(y i |y 1:i−1 ), also called positional score g(y i ), and to generate the output sequence y 1:i−1 through the projection matrix W H and softmax layer as follows.
The negative log-likelihood of generation probabilities is used as the objective function, where θ are all the model parameters to be trained.

Adversarial Example Generation
We found that adversarial examples also exist in the GEC models and up to 100% of input examples admit adversarial perturbations. Adversarial examples yield broader insights into the targeted models by exposing them to such maliciously crafted examples. We try to identify the weak spots of GEC models by a novel adversarial example generation algorithm that replaces the tokens in a sentence with the grammatical mistakes people may make. Our adversarial example generation algorithm also uses the two-step recipe: first determining the important tokens to change and then replacing them with the grammatical mistakes that most likely occur in the contexts.

Identifying Weak Spots
We use the positional scores to find the vulnerable tokens (or positions) that most likely can successfully cause the models to make mistakes once they are modified. The lower the positional score of a token is, the lower confidence the model gives its prediction, and the more likely this prediction will be changed. Using the positional scores also brings another advantage that helps us reduce the bias in the generated pseudo data where too many grammatical errors are caused by the misuse of function words, such as prepositions and articles. We found that the words having relatively lower positional scores are lexical or open class words such as nouns, verbs, adjectives and adverbs. Besides, rare and out-of-vocabulary words are also given low positional scores. By adding the pseudo examples generated by making small perturbations to those tokens, we can force a GEC model to better explore the cases that may not be encountered before. If the function words are used correctly, the model usually gives higher positional scores to them; otherwise, the model will lower its confi-dence in the prediction by decreasing such scores, which is precisely what we expect.
We here formally describe how to use the positional scores to locate the weak spots of a sentence. Like (Bahdanau et al., 2014;Ghader and Monz, 2017), we first use the attention weights α i,j of a seq2seq-based model to obtain the soft word alignment between the target token y i and the source one x j by Equation (6) and (7) below: where W Q and W K denote the projection matrices required to produce the representations of a query q i and a key k j from which an attention score can be computed, and d k is the dimension of h s j . We then can obtain a word alignment matrix A from the attention weights α i,j as follows: When A i,j = 1, we known that y i is aligned to x j . If y i is identified as a vulnerable token, we try to make perturbation to x j to form an attack. The positional scores g(y i ) are obtained by the GEC model trained on the original training set. If the token's positional score g(y i ) is less than a given threshold , we take x j as a candidate to be modified to fool the target GEC model.

Word Substitution-based Perturbations
Although adversarial examples have been studied recently for NLP tasks, previous work almost exclusively focused on semantic tasks, where the attacks aim to alter the semantic prediction of models (e.g., sentiment prediction or question answering) without changing the meaning of original texts. Once a vulnerable position is determined, the token at that position is usually replaced with one of its synonyms. However, generating adversarial examples through such synonym-based replacement is no longer applicable to the GEC task. Motivated by this, we propose two methods to replace vulnerable tokens. One is to create a correction-to-error mapping from a GEC training set and get a suitable substitute using this mapping. If there are multiple choices, we sample a substitute from those choices according to the similarity of their context. Another is to generate a substitute based on a set of rules that imitates human errors. We give a higher priority to the former than the latter when generating the adversarial examples for the GEC.
Context-Aware Error Generation From a parallel training set of GEC, we can build a correctionto-error mapping, by which given a token, we can obtain its candidate substitutes (with grammatical errors) and their corresponding sentences. Assuming that a token is selected to be replaced, and its candidate substitutes are retrieved by the mapping, we want the selected substitute can fit in well with the token's context and maintain both the semantic and syntactic coherence. Therefore, we define a function s based on the edit distance (Marzal and Vidal, 1993) to estimate the similarity scores between two sentences. This function allows us to estimate how well a substitute's context sentence c i is similar to an original sentence c i to be intentionally modified. To encourage the diversity of generated examples, we choose to sample a substitute from the candidates according to the weights w i derived from their sentences' similarity scores to the original one as follows.
Equation (11) describes a weighted random sampling (Efraimidis and Spirakis, 2006) process in which the weights are calculated by the function s(c i , c i ) and truncated by a threshold λ. Note that polysemous words should be carefully handled. For example, the word "change" has two semantic terms with different part-of-speech of noun and verb, which produce different errors. Therefore, we remove the candidates that do not have the same part-of-speech as the original token.
Rule-based Error Generation If we cannot find any candidate substitute, we use a set of predefined rules to generate the substitute. Table 2 lists diverse word transformations for error generation according to the tokens' part-of-speech. To maintain the sentence's semantics to the greatest extent, we just transform the nouns to their singular and plural counterparts instead of searching the synonyms from dictionaries. For verbs, we randomly choose their present, progressive, past, perfect, or third-person-singular forms. Adjectives and corre-  sponding adverbs will be switched into each other, and we also allow them to be replaced by their synonyms here to make the model select more suitable adjectives or adverbs. For numbers and proper nouns, the safe strategy is keeping the word unchanged. All of articles or determiners, prepositions, conjunctions, pronouns have been mapped by context-aware error generation strategy before, the remaining rare words or symbols can be deleted directly or labeled as <unk>.
Besides, when a sentence contains more than one vulnerable point, we can choose to integrate these errors into the sentence to obtain a sentence with multiple grammatical errors or to generate errors separately and obtain more adversarial examples. According to our practice, the former is more suitable for later adversarial training. Finally, the GEC models are supposed to correct the crafted sentences. If the results are different from the unmodified version, the adversarial examples are considered to be generated successfully. Our algorithm of adversarial examples generation for GEC is shown in Algorithm 1.

Adversarial Training
We also show that GEC models' robustness can be improved by crafting high-quality adversaries and including them in the training stage while suffering little to no performance drop on the clean input data. In this section, we conduct adversarial training with sentence pairs generated by large unlabeled corpora and adopt the pre-training and fine-tuning training strategy.
Leveraging Large Unlabeled Corpora The standard BEA-19 training set (Bryant et al., 2019) has only about 640, 000 sentence pairs, which is very insufficient for the GEC task. Thus, the

Algorithm 1 Adversarial examples generation
Input: x: A grammatical sentence with n words. f : A target GEC model. clean sentences in large unlabeled corpora, such as Wikipedia or Gigaword (Napoles et al., 2012), and One billion word benchmark (Chelba et al., 2013), are usually used as seeds to generate ungrammatical sentences for data augmentation. Some studies found that the more unlabeled corpora used, the more improvement the GEC model will achieve (Kiyono et al., 2019). Unlabeled corpora also contribute to correct out-of-training-set errors. If a sentence in the test set contains these unseen errors, the GEC model training without external corpora is hard to correct. Therefore, We also leverage the unlabeled corpora and obtain large-scale adversarial examples for the later training.
Training strategy Adversarial training by means of adding the adversarial examples into the training set can effectively improve the models' robustness. However, Some studies show that the models tend to overfit the noises, and the accuracy of the clean data will drop if the number of adversarial examples dominates the training set. Zhao et al. (2019) and Kiyono et al. (2019) adopt the pre-training and fine-tuning strategy to alleviate the noise-overfit problem. Similarly, our model is pre-trained with adversarial examples then fine-tuned on the original training set instead of adding the large-scale data to the training set directly. The training strategy can be formally divided into four steps: (1) Train a base model f on training set D t .
(2) Generate the adversarial examples set D e on unlabeled sentences by attacking f .
(3) Pre-train the model f on the D e .
(4) Fine-tune it on the training set D t .
We also can use adversarial training to improve the model's robustness: We can alternately run the step (5) and (6) to further improve models' robustness.

Datasets and Evaluations
Like the previous studies of GEC models, we use the BEA-2019 workshop official dataset 3 (Bryant et al., 2019) as our training and validation data. We remove the sentence pairs with identical source and target sentences from the training set and sample about 1.2M sentences without numerals and proper nouns from the Gigaword dataset 4 as our unlabeled data for pre-training. Table 3 shows the statistics of the datasets. Our reported results are measured by the Max-Match (M 2 ) scores 5 (Dahlmeier and Ng, 2012) on CoNLL-2014(Ng et al., 2014 and use the GLEU metric for JFLEG 6 (Napoles et al., 2017).

Models and Hyper-parameter Settings
We adopt Transformer (Vaswani et al., 2017) as our baseline seq2seq model implemented in the fairseq toolkit (Ott et al., 2019), and apply byte-pair-encoding(BPE) (Sennrich et al., 2015)   to source and target sentences and the number of merge operation is set to 8, 000.
The base model is iterated 20 epochs on the training set. For adversarial examples generation, we set the threshold to −0.2, and use the edit distance to calculate the context similarity with the minimum weight λ = 0.1. To avoid the sentences being changed beyond recognition, we choose at most 6 tokens with the lowest positional score to attack. After generation, we use these data to pretrain the base model for 10 epochs and then finetune on the training set for 15 epochs. Our model is also further fine-tuned by adversarial training several epochs. Here, "one epoch" means that we generate an adversarial example against the current model for each sentence in the training set, and the model is continuously trained on those generated examples for three normal epochs. We also implement other data augmentation methods as comparison models on the same unlabeled data and training settings. For direct noise, the operation choice is made based on the categorical distribution (µ add , µ del , µ replace , µ keep ) = (0.1, 0.1, 0.1, 0.7), then shuffle the tokens by adding a normal distribution bias N (0, 0.5 2 ) to the positions and re-sort the tokens For back translation, we trained a reverse model with original sentence pairs and generated the error sentences with diverse beam strength β = 6 following Kiyono et al. (2019) To measure the model's robustness, we conduct adversarial attacks on the model by our adversarial examples generation method to add one or three errors to the test text, respectively (to ensure that errors are always generated during the attack, we set = 0). We measure the models' robustness by the drop of F 0.5 scores and the correction rates of newly added errors. Due to the randomness of the   Table 5: Results of the adversarial attacks on CoNLL-2014 dataset. ATK-1 and ATK-3 respectively denote the attacks with adding one and three errors into each test example. Drop denotes the performance drop in F 0.5 on the adversarial examples and Corr. Rate% denotes the correction rate of newly added errors. The model, indicated by ADV, is first pre-trained with our adversarial examples and then fine-tuned on the training set. The model of ADV † is initialized with ADV, and then adversarially trained only with one epoch, and the model of ADV † † is also initialized with ADV, but adversarially trained after three epochs. Our models have less performance degradation than others under the two attacks while achieving higher correction rates for newly added errors.
attacks, we average the results of 10 attacks.

Results of GEC Task
We compare our model pre-training with adversarial examples to the ones training with the data generated by other data augmentation methods. Table  4 shows the results. We achieve 7.3% improvements of F 0.5 on the base model and also leave a large margin to direct noise (+3.1%) and back translation (+2.8%), which proves that the adver We also conducted ablation experiments to evaluate the different components of our adversarial error generation model. If a random strategy is used to select the weak spots, the F 0.5 score drops  Table 6: The correction rates of the newly added errors versus different types of part-of-speech. "NN" denotes noun, "VB" verb, "JJ" adjective, "RB" adverb, "IN" preposition, "PRP" personal and possessive pronoun. ATK-1 and ATK-3 respectively denote the attacks with adding one and three error into each test example. ADV, ADV † and ADV † † are used to denote the same models as Table 5. to 55.0. If only the context-aware generator is used for word substitution, the resulting F 0.5 score is 55.8, and the model achieves 56.2 in F 0.5 for rulebased generator only. The experimental results show that our strategy of identifying vulnerable token is more effective than the random way, and the two error generators all contribute to improving the models' performance and robustness.

Analysis of Adversarial Attack
We conduct the attack experiments on the base, pretrained and adversarially trained models, including the model provided by Zhao et al. (2019) 8 . Table 5 shows the results of adversarial attack experiments. Under the attack, our models dropped the less in F 0.5 scores and achieved the higher newly added error correction rate, which indicates our adversarial examples can be used to substantially improve GEC models' robustness. The results in Table 5 are worth exploring, from which we can draw several other conclusions: (i) Current seq2seq-based GEC models, including some state-of-the-art models, are vulnerable to adversarail examples. (ii) The models using the direct noise method for data augmentation, Zhao et al. (2019)) for example, are less robust to adversarial attacks even than their vanilla versions. It is likely to be associated with the editing operations of direct noise injecting a lot of taskirrelevant noise, which might be detrimental to the robustness of the model. (iii) Combined with adversarial training, the robustness of the model can be improved continually at the cost of acceptable loss in performance. We also analyzed the trade-off between generalization and robustness at the adver-sarial training stage. The results are visualized in Figure 1. We would also like to know which type of words to modify is most likely to form a successful attack. Therefore, we calculate the correction rates of the newly added errors with different types of part-ofspeech. Table 6 shows that: (i) The robustness of the model after adversarial training is significantly improved against the attack that tries to replace the lexical words such as nouns and verbs. It shows that the generated examples by our adversarial attack algorithm cover a variety of grammatical errors involving various POS types. (ii) The errors involving adjectives and adverbs are less likely to be corrected without adversarial examples. Whether an adjective or adverb is properly used heavily relies on the context, making it difficult for the GEC systems to correct them all. (iii) Not surprisingly, most of the prepositions errors are inherently hard to correct, even for native speakers.

Conclusion
In this paper, we proposed a data augmentation method for training a GEC model by continually adding to the training set the adversarial examples, particularly generated to compensate the weakness of the current model. To generate such adversarial examples, we first determine an important position to change and then modify it by introducing specific grammatical issues that maximize the GEC model's prediction error. The samples generated by our adversarial attack algorithm are more meaningful and valuable than those produced by recently proposed methods, such as the direct noise and the back translation. Experimental results demonstrate that the GEC models trained with the data augmented by these adversarial examples can substantially improve both generalization and robustness.