Adversarial Grammatical Error Correction

Recent works in Grammatical Error Correction (GEC) have leveraged the progress in Neural Machine Translation (NMT), to learn rewrites from parallel corpora of grammatically incorrect and corrected sentences, achieving state-of-the-art results. At the same time, Generative Adversarial Networks (GANs) have been successful in generating realistic texts across many different tasks by learning to directly minimize the difference between human-generated and synthetic text. In this work, we present an adversarial learning approach to GEC, using the generator-discriminator framework. The generator is a Transformer model, trained to produce grammatically correct sentences given grammatically incorrect ones. The discriminator is a sentence-pair classification model, trained to judge a given pair of grammatically incorrect-correct sentences on the quality of grammatical correction. We pre-train both the discriminator and the generator on parallel texts and then fine-tune them further using a policy gradient method that assigns high rewards to sentences which could be true corrections of the grammatically incorrect text. Experimental results on FCE, CoNLL-14, and BEA-19 datasets show that Adversarial-GEC can achieve competitive GEC quality compared to NMT-based baselines.


Introduction
Grammatical Error Correction (GEC) has grown into a popular NLP task that deals with building systems for automatically correcting errors in written text (Ng et al., 2013. Evolving from the approaches of building error-specific machine learning classifiers (Tetreault and Chodorow, 2008;De Felice and Pulman, 2008;Tetreault et al., 2010;Dahlmeier and Ng, 2011;Rozovskaya and Roth, 2014), it has gained popularity as a monolingual Machine Translation (MT) problem, where the system learns to "translate" a given erroneous text to its corrected form (Brockett et al., 2006;Felice et al., 2014;. Initially, Statistical phrase-based Machine Translation (SMT) techniques were successfully applied to the task (Yuan and Felice, 2013;Junczys-Dowmunt and Grundkiewicz, 2016; as as a way to handle all error types concurrently. More recently, several Neural Machine Translation (NMT) systems have been developed with promising results (Sutskever et al., 2014;Bahdanau et al., 2015;Cho et al., 2014), and their successful application to GEC, either in combination with SMT models (Chollampatt et al., 2016;Yannakoudakis et al., 2017;, or strictly as neural models, has emerged as the new state-of-theart (Xie et al., 2016;Schmaltz et al., 2017;Sakaguchi et al., 2017;Ji et al., 2017;Ge et al., 2018;Chollampatt and Ng, 2018a,b;Zhao et al., 2019).
Despite the successes of NMT-based models for GEC, a major challenge still lies in the definition of the evaluation metrics. Ideally, the metric should be able to quantify the (a) lexical overlap, (b) semantic similarity, and (c) grammaticality of a generated sentence, given a grammatically incorrect input sentence. In a straightforward application of NMTbased models to the GEC task, one would minimize a surrogate loss (e.g., cross-entropy), which is an upper bound on the true loss, and hence a loose approximation of these complex criteria. Moreover, NMT-based GEC models try to maximize n-gram or edit-based metrics, such as M 2 (Dahlmeier and Ng, 2012), I-Measure (Felice and Briscoe, 2015), or GLEU (Napoles et al., 2015) pushing the NMTbased models to generate sentences with n-gram precisions as high as possible, which may not necessarily lead to high-quality generation for the GEC task. In order to avoid these issues, we take a different approach, inspired by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), which provide a framework that can be leveraged to directly model the task based on the differences in the input-output distributions and the complex criteria mentioned above. Moreover, GANs have shown remarkable ability to generate coherent and semantically meaningful text in many natural language processing tasks such as machine translation (Wu et al., 2018;, dialogue generation (Li et al., 2017), and abstractive summarization Wang and Lee, 2018) among others.
We propose a GAN-based generatordiscriminator framework for grammatical error correction. The generator is a Sequence-to-Sequence (Seq2Seq) model, which is trained to "translate" a grammatically incorrect sentence to its grammatically correct rewrite. The discriminator, a deep neural sentence-pair classification model is trained to evaluate the probability of the generated sentence being a lexically-similar, meaningpreserving, and grammatically correct rewrite of the incorrect input sentence. Adversarial training between the two models is set up as optimizing a min-max objective, where the discriminator learns to distinguish whether a given input is sampled from the ground-truth (human-generated) or generator (artificially-generated) distributions, maximizing the difference between them. The generator, on the other hand, learns to trick the discriminator by producing high-quality correction candidates, thus, minimizing the difference between its output and a ground-truth corrected sentence. Further, the discriminator is used to fine-tune the generator using a policy gradient (Williams, 1992;Yu et al., 2017;Wu et al., 2018), rewarding high quality generated text when conditioned on the source, improving, thus, the generation results. By minimizing the difference between the human-and the artificially-generated distribution, we aim at directly optimizing the task based on the criteria mentioned above.
We evaluate the effectiveness of our approach on three standard datasets on the task, observing that the discriminator can provide reasonably consistent guidance to the generator and further help improve its performance. Experimental results indicate that our model can achieve significantly better performance than strong NMT-based baselines. In summary, we make the following contributions: • This work is, to the best of our knowledge, the first to apply generative adversarial training to Figure 1: Adversarial-GEC training. Left: D is trained over the real and the generated data by a pre-trained G. Right: G is further trained by policy gradient where the final reward is provided by D and is passed back to the generator.
the GEC task.
• We propose a sentence-pair classificationbased discriminator, that can better distinguish grammatical text from ungrammatical text by learning to directly optimize the task rather than constructing or relying on n-gram or editbased metrics. We analyze different formulations of the discriminator, and provide insights into how its setup, pre-training and integration into the framework can be leveraged for stable training and better performance.
• We conduct extensive experiments on standard GEC datasets and evaluate the system against strong baselines, showing that the proposed model consistently achieves better results in a self-contained single model setting, without relying on any resources other than just the training data.

Generator
Following recent NMT-based state-of-the-art GEC systems, we treat a grammatically incorrect sentence as the source and its grammatically corrected counterpart as the target. Formally, given a sequence x = [x 1 , x 2 , . . . , x S ], we aim to generate another sequence y = [y 1 , y 2 , ..., y T ] which is the grammatically corrected form of x. We denote a pair of incorrect-correct sentences as (x, y). Given a sequence x, the generator learns to produce another sequence y ≈ y.
While the generator can be any Seq2Seq model, we use two common Encoder-Decoder architectures for GEC; an attention-based RNN (Luong et al., 2015) and a Transformer (Vaswani et al., 2017).

Discriminator
In this framework, a critical component is a discriminator that is responsible for providing the appropriate reward to the generator based on the quality of the generated text. Most GAN architectures typically use a single-sentence real-vs-fake classifier 1 as the discriminator (Yu et al., 2017). However, we argue that such a formulation does not accurately express the GEC task objective. A conventional GAN discriminator would provide the probability of a sentence being grammatically correct as the reward. However, it would be especially harder for such a classifier to differentiate between a groundtruth correction and a generated sentence that fits the distribution of real-world text and is far from the generated data, but does not make the intended corrections or changes the semantics of the source. Moreover, it would also be unable to provide a proportionate reward to a partially corrected sentence. Due to the lack of contextual knowledge about what has been corrected, such a classifier would struggle to differentiate between low-quality or unsuitably corrected sequences. Consequently, it will end up giving them rewards comparable to sentences which are truly the corrected forms of given incorrect source sentences.
In the GEC task, we ultimately want the generator to generate corrected sentences that fit the constraints mentioned in Section 1. Hence, we formulate the objective of the discriminator as being two-fold: first, to be able to evaluate the quality of the generated text in terms of its validity compared to the ground-truth distribution, and second, to measure its quality as the appropriate rewrite for a given input sentence. In summary, the discriminator needs to be able to measure the degree of "grammatical correctness" of an output sentence, given its corresponding input sentence, instead of only distinguishing between real-vs-fake Therefore, instead of training a single-sentence classifier, we train on incorrect-correct sentence pairs. We consider ground-truth data (x, y) as high-quality corrections (positive examples), while data sampled from the generator (x, y ) as low-quality (negative examples). We experiment with two discriminator models for both the single-sentence and sentencepair formulations: CNN-and RNN-based due to their simplicity, widespread use in sentence-pair modeling tasks, and ease of implementation.

Adversarial Training
Adversarial training between G and D (parameterized by θ and φ, respectively) is set up as optimizing a min-max objective, formulated as the following objective function V (G θ , D φ ): where P data is the underlying training data distribution and P G θ (·|x) is the distribution of the generator output.
With this objective function, the discriminator learns to predict whether a given sentence pair has been sampled from the ground-truth data (x, y) or from G θ : (x, y ). G θ tries to confuse D φ by generating high-quality corrected samples y ≈ y, given a ground-truth input sentence x. Formally, the objective function of D φ is defined as the standard binary cross entropy (BCE) loss: The objective of the generator can be formulated as optimizing the following loss: However, since the generator performs discrete sampling to obtain y , we cannot directly use the gradient-based approach to backpropagate the gradients, making V (G θ , D φ ) non-differentiable with respect to θ. To address this issue, borrowing from  and Wu et al. (2018), we use single-sample based REINFORCE (Williams, 1992), a Monte-Carlo policy gradient method to optimize G θ . In Reinforcement Learning (RL) terms, the generator G θ acts as the agent under the policy G θ (·|x), and the generated grammatically corrected sentence y is the action. The environment is characterized via the input sequence x and the discriminator D φ , which provides the reward − log(1 − D φ (x, y )) based on the discriminative loss of D φ (x, y ). The generator improves itself by maximizing the reward returned from the environment. The gradients ∇ φ L d and ∇ θ L g can thus be estimated by sampling a correction from the generator y ∼ G(·|x) as follows: where φ and θ can be updated as per the REIN-FORCE algorithm.

Training Strategies
While REINFORCE provides a framework where the reward function does not have to be differentiable, the discrete reward space due to the use of a single sampled y to perform the Monte Carlo estimation leads to the problem of high variance, resulting in unstable training -a widely acknowledged limitation of RL methods. In practice, we find that adversarially training the generator solely with Eq. 3 is unstable, even when it is pre-trained. This is due to the sparsity of the rewards provided to the generator, which happens only once it has fully generated a sentence. This is also compounded by the fact that we do not generate multiple samples for computational efficiency. Hence, the generator training becomes brittle and finds it extremely difficult to get out of bad local minima or mode collapse. To alleviate this issue, we leverage the following measures: baseline reward, and teacher forcing/interleaved training to train the generator.
Baseline Reward A popular technique to alleviate the variance issue is the subtraction of baseline values from the original rewards. The baseline reward could be computed using various approaches.  Liu et al. (2017) use an MLP to estimate the baseline reward. However, these methods rely on approximating the terminal reward using intermediate states, or incorporating word-level rewards via rollout strategies for better credit assignment. Moreover, such approaches have been found to be extremely time-consuming, given the large decoding space. Based on prior works on RL for modeling dialog systems, which also have discrete actionreward spaces (Sankar and Ravi, 2019;Su et al., 2015), we use a moving average of the historical reward values as the baseline, which stabilizes the training process and is computationally tractable.
Interleaved Training Following Guo et al. (2018) and Wu et al. (2018), we interleave MLE and Policy Gradient training. This combination of an adversarial objective with MLE is an important factor in successfully training G. By some probability λ (more details in Section 5.3), randomly chosen mini-batches are trained with the Policy  (Bryant et al., 2017)). We report F 0.5 scores evaluated by the M 2 scorer (Dahlmeier and Ng, 2012) for both of these test datasets.

Baselines
We use the two generators introduced in Section 2.1 as baseline generators. Building on these baselines, we develop GAN frameworks, in combination with the following setups of discriminators -a) SS: CNN-and RNN-based Single Sentence classifier, 3 and b) SP: CNN-and RNN-based Sentence-Pair classifier (Section 2.2). We also experiment with using the GLEU score directly as the reward for an input-output sentence pair. This setting overlaps with the work of Sakaguchi et al. (2017). 4

Data
Following Junczys-Dowmunt et al. (2018), we use byte-pair encoding (BPE) sub-word units (Sennrich et al., 2016), which is also the way to address the issue of out-of-vocabulary words. The vocabulary is based on 35k most frequent BPE subword units, where both the source and target side use the same vocabulary.

Generators
We refer to  for our training setup, who laid out extensive guidelines for adapting NMT-based models for the GEC task. Algorithm 1 Adversarial-GEC 1: Initialize G θ , D φ with random weights θ, φ. 2: Pre-train G θ on ground-truth dataset D = (X, Y ) with MLE loss 3: Generate negative samples D = (X, Y ) using G θ for training D φ 4: Pre-train D φ on D and D until initial accuracy ε with BCE loss 5: while not converged do 6: Sample (X, Y ) ∼ P data 7: Sample Y ∼ G θ (·|X) 8: Sample ρ ∼ [0, 1] to determine interleaving 9: if ρ ≤ λ then 10: Compute Rewards R for (X, Y ) using D φ 11: Update G θ via Policy Gradient using R 12: else 13: Update G θ via teacher-forcing using MLE 14: Train D φ using Eqn. 2, on (X, Y ) and (X, Y ) 15: *Parameter update equations for G θ and D φ are as follows: but also worked well in practice when tuned on the development sets.

Sentence-Pair Discriminators
The RNN-based discriminator model is set up as a siamese network, sharing the same embeddings and weights, each processing one of the two sentences. The RNN-based model, for each sentence in the pair, consists of a word embedding layer of size 300, followed by two layers of bi-directional GRU, with hidden size of 128. There are residual connections at each time step between the layers. The bi-directional outputs of the last recurrent layer of both the sentences in the pair are concatenated, and used as input to a dense feed-forward layer with an output of size 128, followed by a sigmoid. We use dropout on the recurrent units and between layers (both with probability 0.2). For the CNN-based discriminator, we use the convolutional matching model used by Wu et al. (2018) since Hu et al. (2014) found it to have a superior performance to the siamese architecture.

Training
A major challenge with GANs is that the joint training between the generator and the discriminator needs to be carefully coordinated, in order to stabilize the training (Yu et al., 2017;Li et al., 2017;Wu et al., 2018;Fedus et al., 2018;Wang and Lee, 2018). Therefore, we first pre-train the generator model G θ using maximum likelihood estimation (MLE) on the ground-truth training dataset until convergence. This stage is  The pre-trained model is then used to decode the training data x using beam search (size 4), and generate the output sentences y , essentially building the negative examples in the training data for the discriminator (x, y ). The discriminator is initially pre-trained on a combination of the ground-truth parallel data (x, y) and the machine-generated data (x, y ), where y is sampled from the pre-trained generator model. The discriminator is trained until the classification accuracy reaches ε (further analysis in Section 5.2). Once the generator and the discriminator have been pre-trained, they are adversarially co-trained, where the generator is trained with a combination of MLE and Policy Gradient (and teacher forcing), until the performance of G θ does not improve on the development set. 5

Results
In contrast to related works on Neural GEC, we do not use a lot of the heuristics that most recent systems leverage in order to enhance their model performance pre-and post-training. These heuristics include using spellcheckers to correct spelling errors in the data, pre-trained language models trained on large quantities of external data, synthetic data generation, re-ranking systems to sort the outputs of the generator model, among others. We chose to keep our framework simple compared to most contemporary works in that we do not lever-5 More details in Appendix A.
age anything beyond what the raw training data and the baseline architectures have to offer, which makes it simple and self-contained. This decision was in the interest of system complexity, training time, and clear evaluations. The goal of this work is not to build a state-of-the-art GEC system but to demonstrate the value of adversarial training. Hence, we report results in a single-model setting, without the use of any external data or resources beyond the training data.
The results of Adversarial-GEC compared to baseline models are presented in Table 2. 6 These results are based on the best performing (on the development set) parameters ε = 0.7, λ = 0.4 using the CNN sentence-pair discriminator. The results demonstrate a substantial improvement in F 0.5 for both adversarially trained models, across all evaluation datasets. Overall, the RNN model achieves greater gains on precision than the Transformer, which achieves greater gains on recall. We carry out statistical significance tests with bootstrap resampling, and correcting for multiple comparisons, obtain significant gains over the baselines (p < 0.01).
As mentioned in Sections 2.2 and 3.2, we experiment with three discriminator formulations (SS, SP, GLEU) in the Adversarial-GEC setting to provide the rewards to guide the generators. Table 3 describes the results of using the two kinds of discriminators in each formulation (CNN, RNN) of  the discriminative task, and doesn't show a significant difference in either formulation.

Discussion
In this section, we describe experimental results on adversarial training strategies, based on validation data splits. There are three parts to making the training work (a) formulating the discriminator task to compute the reward, (b) reducing the variance in rewards for better gradient estimation, and (c) combining the MLE and Adversarial objectives for more stable training.

Discriminator Formulation
We observe in Table 3 that the single-sentence discriminator (SS) performs the worst against all discriminator formulations. Furthermore, SS performs even worse than the baseline generators, which points to the direction that it acts as a barrier in their ability to generalize. We attribute this performance limitation to two factors. First, since the model does not consider the original sentence, it lacks the ability to learn the parts of the sentence which make it ungrammatical, rewarding similarly marginally correct and highly incorrect sentences. We investigate this idea by feeding the discriminator incorrect sentences sampled from P data and observe that they get nearly the same reward from SS despite their varying degrees of incorrectness. This impedes generator improvement as any inaccuracies are penalized disproportionately. Secondly, producing grammatically correct sequences is not enough to solve the task. A generated sequence can be grammatically correct, albeit semantically or lexically different. A discriminator which lacks the contextual information provided by the original sentence can reward such sequences with a high reward propagating such false starts. Therefore, a generator that produces only one grammatical sentence would receive a high reward from the discriminator.
On the other hand, GLEU achieves better performance compared to SS but weaker when compared to SP. This corroborates the above argument as GLEU, essentially being a special case of the SP formulation, is able to provide higher quality reward since it tries to account for fluency and grammaticality in evaluation on references. SP, on the other hand, is able to go beyond the GLEU score's low-level n-gram matching criteria, learning latent characteristics of the GEC task and providing a more appropriate reward to the generator. Acting in this way provides a much smoother objective compared with GLEU since the latter is quite sensitive to slight translation differences at the word or phrase level. Second, the generator and discriminator co-evolve. The dynamics of the discriminator make the generator grow in an adaptive way rather than controlled by a fixed evaluation metric such as GLEU, achieving better distributional alignment, which is further verified by its superior performance.

Balancing Discriminator Pre-Training
Since GAN training is a min-max loss optimization with alternating updates to the generator and the discriminator, it is hard to reach a global optimum, which is a saddle point. To successfully reach the saddle point, balancing the generator and the discriminator co-training is essential. But the discriminator usually converges faster than the generator, so it is hard to achieve that balance. Failure to do so often leads to problems like mode collapse or inability to learn altogether. While the generator is pre-trained to reach the best development-set performance, we control the discriminator pre-training to balance the adversarial training. Hence, we evaluate the impact of the pre-trained discriminator's accuracy ε as a tunable hyperparameter. We pretrain seven RNN discriminators to reach accuracy in the range [0.6, 0.9]. With these discriminators, we train corresponding Adversarial-GEC models (using a Transformer generator, λ = 0.4) and evaluate their performance on the development set at regular intervals. Fig. 2 shows that the initial accuracy of the discriminator significantly impacts the final performance and needs to be set carefully. If it is either too high (0.85 and 0.9) or too low (0.6 and 0.65), the model performs poorly. This points to the need for a balanced relationship between the generator and the discriminator. If the discriminator is too strong, the generator is extremely penalized for its erroneous predictions, and the performance progressively gets worse. On the other hand, if the discriminator is too weak, it is unable to give the most appropriate guidance to the generator. Empirically, we pre-train the discriminator until its accuracy reaches the 0.7-0.75 range.

Combining MLE and Adversarial Objectives
As noted in Section 2.4, a key factor in successfully training G θ is the combination of adversarial and MLE objectives where we define the hyperparameter λ to control the trade-off between MLE and adversarial training. That is, for any mini-batch, determined by a probability λ, G θ is optimized by the MLE objective or adversarial objective to improve the stability in model training. We experiment with the range [0.2, 0.8] for λ. The results in Fig. 3 show that combining the MLE objective with the adversarial objective is helpful to stabilize the training and improve the model performance, as we expected. This confirms prior findings that MLE acts as a regularizer to guarantee smooth model updates, alleviating the negative effects brought by high gradient estimation variance of the one-step Monte-Carlo sample in REINFORCE. However, further increasing λ does not bring more gain. The best trade-off between MLE and adversarial objective in our experiment is λ = 0.4, which is the value we use in our experiments.

Experiments with Language Models
In the SS setting, we also experimented with a locally-normalized language model as a discriminator. The intuition here was that using a language model with token-level locally normalized probabilities could offer a more direct training signal to the generator. If a generated sentence does not match the distribution of ground-truth data, it will have high perplexity when evaluated by a language model that was trained on ground-truth data. Not only can it provide an overall evaluation score for the whole sentence, but can also assign a probability to each token, thus providing more information on which word is to blame if the overall perplexity is very high. However, in spite of all the training strategies described in Section 2.4, training a language model was highly unstable, due to the use of a single sample to approximate the expected gradient, leading to high variance in gradient estimates. In future works, we aim to explore this idea using better generator models and better, larger-scale language models such as BERT (Devlin et al., 2018) and GPT-3 (Brown et al., 2020).

Related Work
While the choice of a sentence-pair discriminator is close to  and Wu et al. (2018), our work differs from  in that their learning objective is a combination of the discriminator reward (D) and a smoothed sentence-level BLEU (Papineni et al., 2002) as the static reward (Q). The use of a sentence-pair discriminator is related to our work, we do not combine rewards from D and Q. Incorporating Q in the objective stems from the motivation to directly optimize for the evaluation metric, we choose to not force the evaluation metric-based reward into the objective, since most GEC metrics are reference-based, and have shown to be limiting for the task (Choshen and Abend, 2018;Chollampatt and Ng, 2018c). Similarly, among existing works for GEC, our work is the closest to Sakaguchi et al. (2017), but they also directly maximize GLEU in training their GEC system, using a REINFORCE-based approach similar to ours. We instead let the model learn the latent nuances of the objective directly from the data, and provide the appropriate reward to the generator, preserving the learning objective as in Yu et al. (2017), albeit with a different discriminator framework. Our work is closest to Wu et al. (2018), who built an RNNSearch-based Generator (Bahdanau et al., 2015) and a CNN-based sentence-pair discriminator for NMT.

Conclusion
We propose a task-appropriate training objective for GEC, using an adversarial training framework consisting of a generator and a discriminator, based on the Adversarial-NMT framework of Wu et al. (2018). The generator is modeled as a Seq2Seq model, and the discriminator is modeled as a deep sentence-pair matching model, which provides rewards to the generator input-output. The framework supervises the generator to reflect the mapping within (source, target) sentence, and an efficient policy gradient algorithm to tackle the optimization difficulty brought by the discrete nature of generation. Experiments on standard GEC test datasets demonstrate the effectiveness of our framework for the task. Additionally, we provide insights into how the discriminator setup, pre-training and integration into the framework can be optimized for stable training and better performance. We show that the proposed framework consistently achieves better results in a self-contained single model setting, without relying on any external resources. In the future, we plan to improve the task-specific framework and training techniques based on recent state-of-the-art methods (Grundkiewicz et al., 2019;Choe et al., 2019), and improve issues with sparse rewards by exploring better credit assignment techniques.