A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction

Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of “noise” where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.


Introduction
Grammatical error correction (GEC) is often considered a variant of machine translation (MT) (Brockett et al., 2006; due to their structural similarity-"translating" from source ungrammatical text to target grammatical text. At present, several neural encoderdecoder (EncDec) approaches have been introduced for this task and have achieved remarkable results (Chollampatt and Ng, 2018;Zhao et al., 2019;Kiyono et al., 2019). EncDec models tend to further improve in performance with increasing data size (Koehn and Knowles, 2017;Sennrich and Zhang, 2019), however, this is not necessarily true in GEC. For example, Lo et al. (2018) reported that an EncDec-based GEC model trained on EF-1 : Errors are inappropriately edited Source: I want to discuss about the education. Target: I want to discuss of the education.

: Errors are left uncorrected
Source: We discuss about our sales target. Target: We discuss about our sales target. Table 1: Example of an inappropriately corrected error and an unchanged error in EFCamDat. We consider these types of errors to be dataset noise that might hinder GEC model performance.
CamDat (Geertzen et al., 2013) 1 , the largest publicly available learner corpus as of today (two million sentence pairs), was outperformed by a model trained on a smaller dataset (e.g., 720K pairs). They hypothesized that this may be due to the noisiness of EFCamDat, i.e., the presence of sentence pairs whose correction still contained grammatical errors due to inappropriate edits or being left uncorrected. For example, in Table 1, "discuss about" should most likely have been corrected to "discuss", and "are discussing", respectively. We confirmed that there is a non-negligible amount of noise in commonly used GEC datasets (Section 3).
We recognise data noise as a generally overlooked issue in GEC, and consider the question of whether a better GEC model can be built by reducing noise in GEC corpora. To this end, we designed a self-refining approach-an effective denoising method where residual errors left by careless or unskilled annotators are corrected by an existing GEC model. This approach relies on the consistence of the GEC model's predictions (Section 4).
We evaluated the effectiveness of our method over several GEC datasets, and found that it considerably outperformed baseline methods, includ-ing three strong denoising baselines based on a filtering approach, which is a common approach in MT (Bei et al., 2018;Junczys-Dowmunt, 2018;Rossenbach et al., 2018). We further improved the performance by applying task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. Finally, through our analysis, we found unexpected benefits to our approach: (i) the approach benefits from the advantage of self-training in neural sequence generation due to its structural similarity (Section 6.3), (ii) resulted in significant increase in recall while maintaining equal precision, indicating improved coverage of correction (Section 6.4), and (iii) there seems to be a tendency for more fluent edits, possibly leading to more native-sounding corrections (Section 6.5). The last is reflected in performance on the JFLEG benchmark, which focuses on fluency edits.
In summary, we present a data denoising method which improves GEC performance, verify its effectiveness by comparing to both strong baselines and current best-performing models, and analyze how the method affects both GEC performance and the data itself.

Related Work
In GEC, previous studies have generally focused on typical errors, such as the use of articles (Han et al., 2006), prepositions (Felice andPulman, 2008), and noun numbers (Nagata et al., 2006). More recently, many studies have addressed GEC as a MT problem where ungrammatical text is expected to be translated into grammatical text. This approach allows the adoption of sophisticated sequenceto-sequence architectures (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) that have achieved strong performance but require a large amount of data (Chollampatt and Ng, 2018;Kiyono et al., 2019). In GEC, the data are usually manually built by experts, which lead to an underlying assumption that the data is noise-free. Therefore, to the best of our knowledge, noise in existing common datasets remains largely under-explored and no previous research has investigated the effectiveness of denoising GEC datasets. Recently, Lichtarge et al. (2020) proposed a method for filtering large and noisy synthetic pretrained data in GEC by deriving example-level scores on their pretrained data. However, what they regard as noise consists of instances in source sentences (i.e., not target sentences) of the synthetic data that are outside the genuine learner error distribution, where they perform data selection based on the small and higherquality genuine data (namely, the learner corpora we attempt to denoise in this study). Therefore, our methods are not comparable, and it is expected to further improve the performance by combining both methods, which we plan to investigate in our future work.
In contrast, data noise is becoming an increasingly important topic in MT, where it is common to use automatically acquired parallel data via web crawling in addition to high-quality curated data. As a result, the MT field faces various data quality issues such as misalignment and incorrect translations, which may significantly impact translation quality . A straightforward solution is to apply a filtering approach, where noisy data are filtered out and a smaller subset of high-quality sentence pairs is retained (Bei et al., 2018;Junczys-Dowmunt, 2018;Rossenbach et al., 2018). Nevertheless, it is unclear whether such a filtering approach can be successfully applied to GEC, where commonly available datasets tend to be far smaller than those used in recent neural MT research. Hence, in this study, we investigate its effectiveness by conducting a comparative experiment using the proposed denoising approach.

Noise in GEC Datasets
In this study, we define noise as two types of residual grammatical errors in target sentences: inappropriate edits and those left uncorrected (Table 1). Most learner corpora, such as EFCamDat and Lang-8 (Mizumoto et al., 2011;Tajiri et al., 2012), are constructed based on correction logs in which the source texts are provided by human language learners and the corresponding corrected target texts are provided by editor (annotators). Unless each annotator has 100% accuracy, all corpora inevitably contain noise.
The presence of noise in GEC data was uncovered by previous work such as Lo et al. (2018), but the exact nature of it was unexplored. To confirm this, we manually assessed how much noise was contained in the following three commonly used training datasets: the BEA official training dataset (henceforth, BEA-train) provided in the (1) BEA-train X : I will make a poet to kill this pain. Y : I will make a poem to kill this pain. Y : I will write a poem to get rid of this pain.
(2) EFCamDat X : The restaurant in front of movie teather. Y : The restaurant in front of movie theater. Y : The restaurant is located opposite the movie theater.
(3) Lang-8 X : Coordinate with product support team for potential customer show site visit ; Y : Coordinate with product support team for potential customer show site visits; Y : Please coordinate with the product support team to escort potential customers to site visits.  BEA-2019 workshop (Bryant et al., 2019) 3 , EF-CamDat, and the non-public Lang-8 corpus (henceforth, Lang-8) 4 . For 300 target sentences Y from each dataset, one expert reviewed them and we obtained denoised ones Y (Table 2). We then calculated the word edit rate (WER) between the original target sentences Y and the denoised target sentences Y . WER is defined as follows: where, |Y i | is the total number of words in each original target sentences Y i and d(·) is the wordbased Levenshtein distance. Table 3 shows the amount of noise in the datasets estimated by WER.
Here, the WER values are slightly higher than expected, but this is most likely caused by fluency edits by the editor, making the sentence more native-like. Thus, we found that (i) there is a nonnegligible amount of "noise" in the most commonly used training data for GEC, and (ii) EFCamDat is much noisier than the other two training datasets.

Proposed Denoising Method
The supervised learning problem for GEC is formally defined as follows. Let θ be all trainable parameters of a GEC model, and D be training data consisting of pairs of an ungrammatical source sentence X and a grammatical target sentence Y , i.e., Then, the objective is to find the optimal parameters θ that minimize the following loss function L(D, θ) on training data D: Conventionally, training data D is assumed to be "clean" parallel data. However, as argued in Section 3, this assumption typically does not hold in GEC. Here, we assume that training data D is "noisy", and, for clarity, we use the notationD to represent "clean" parallel data, where "clean" means "denoised" in this context. The goal is, first, to obtain a new setD by denoising D, and then, to obtain a GEC model θ on the new training dataD.
To deal with data noise, a straightforward solution is to apply a filtering approach, where noisy data are filtered out and a smaller subset of highquality sentence pairs is retained, as employed in MT. However, applying a filtering approach may not be the best choice in GEC for two reasons: (i) GEC is a low-resource task compared to MT, thus further reducing data size by filtering may be critically ineffective; (ii) Even noisy instances may still be useful for training since they might contain Train a denoised new model θ fromD some correct edits as well (Note that these correct edits would have also been lost to filtering, further decreasing the amount of informative cues in training).
As an alternative to filtering, we propose a selfrefinement (SR) approach for denoising GEC training data (Algorithm 1). The main idea is to train a GEC model (henceforth, base model) on noisy parallel data D and to use it for refining target sentences in D. Noisy annotations are potentially caused by carelessness or insufficient skills of annotators. This causes inconsistent corrections in similar context. In contrast, machine learning-based GEC models, such as EncDec, tend to be reliably consistent given similar contexts. Given noisy par- , we generate new target sentencesŶ i from the original target sentences Y i and pair them with their original source sentences X i (line 4 in Algorithm 1). The consistency of the base model predictions ensures that the resulting parallel dataD = {(X i ,Ŷ i )} n i=1 contain noise at a less extent. It is worth noting that SR can be regarded as a variant of self-training due to its structural similarity, except that it takes the target sentences rather than the source sentences as input to the model. The algorithm itself is the key difference from existing methods based on selftraining (Wang, 2019;Nie et al., 2019;Xie et al., 2020).
One challenge of this approach is that the base model may consistently make inaccurate corrections. We thus incorporate a fail-safe mechanism as a sub-component to restore the original target sentence if the GEC model makes incorrect corrections (lines 5-9). For example, in cases such as in Table 1, the base model may predict every instance as "discuss about". In this step, to determine whether to accept the output Y of the base model as a new target sentence, we compare the perplexity of the model output PPL(Y ) with that of the original target sentence PPL(Y ). Language models are trained on native-written corpora, meaning they can reasonably be assumed to contain information needed to estimate grammaticality. We believe that a measure of perplexity is a straightforward approach to exploit this information.

Experiments
We evaluate the proposed method in two ways. First, we exclusively focus on investigating the effectiveness of the proposed denoising method (Section 5.3). Then, we compare our strongest model trained with denoised data (henceforth, denoised model), with current best-performing ones to investigate whether the proposed method has a complementary effect on existing task-specific techniques (Section 5.4).

Configurations
Dataset For the training dataset, we used the same datasets as mentioned in Section 3: BEAtrain, EFCamDat, and Lang-8. In addition, we used the BEA official validation set (henceforth, BEA-valid) provided in the BEA-2019 workshop as validation data. The characteristics of the datasets are summarized in Table 4. For preprocessing, we tokenized the training data using the spaCy tokenizer 5 . Then, we removed sentence pairs where both sentences where identical or both longer than 80 tokens. Finally, we acquired subwords from the target sentence via the byte-pair-encoding (BPE) (Sennrich et al., 2016b) algorithm. We used the subword-nmt implementation 6 and then applied BPE to splitting both source and target texts. The number of merge operations was set to 8,000.
Evaluation To investigate the effectiveness of the proposed method, we followed the work by  and evaluated the performance of the GEC models across various GEC datasets in terms of the same evaluation metrics. We report   (Napoles et al., 2017). All reported results (except those corresponding to the ensemble models) are the average of three distinct trials using three different random seeds. Let us emphasize that our focus is on denoising the training data, and denoising the test data is out of the scope of this study. The commonly used test data, such as CoNLL-2014 and JFLEG, have multiple references which can lower the noise factor. In addition to having multiple references, both JFLEG and CoNLL-2014 have been specifically constructed for GEC evaluation, while the training data (Lang-8 and EFCamDat) are more of an organic collection of learner and editor interactions. Naturally, we believe it is reasonable to assume that the test data are considerably cleaner.
Model We employed the "Transformer (big)" settings Vaswani et al. (2017) using the implementation in the fairseq toolkit . Details on the hyper-parameters are listed in Appendix B. As a language model for the fail-safe mechanism, we used the PyTorch implementation of GPT-2 (Radford et al., 2019) 9 . Note that to avoid a preference for shorter phrases, we normalized the perplexity by sentence length.

Baselines
As argued in Section 4, we hypothesized that the filtering-based denoising approaches are not wellsuited for GEC. To verify this hypothesis, we employed the following three filtering-based denoising baseline methods in addition to a base model trained in noisy parallel data D (henceforth, no denoising).
Cross-entropy filtering (CE filtering) The dual conditional cross-entropy filtering method was proposed by Junczys-Dowmunt (2018) and achieved the highest performance on the noisy parallel corpus filtering task at WMT2018 .
In this study, we prepared forward and reverse pretrain models using the BEA-train dataset to adapt the filtering method to GEC. We obtained the filtered data by removing 20% of the sentence pairs 10 with higher scores from the training data and used them for training.
Sentence-level error detection filtering (SED filtering) Asano et al. (2019) demonstrated the effectiveness of the sentence-level error detection (SED) model as a filtering tool to preprocess GEC input. Considering these findings, we adopted SED as a filtering-based denoising method for training data. More specifically, we discarded the sourcetarget sentence pairs in the noisy parallel data D if the SED model predicted the target sentence as an incorrect one. Following Asano et al. (2019), we obtained binary-labeled data using the BEA-train dataset to prepare a training set for the SED model, and then fine-tuned BERT (Devlin et al., 2019) on the prepared data.
Language model filtering (LM filtering) Language model-based filtering is a method based on the hypothesis that if the perplexity of a target sentence is larger than that of the source sentence, the target sentence is more likely to contain noise. LM filtering has the same motivation as the one underlying the fail-safe mechanism. We used GPT-2 as the pre-trained language model.    methods, such as SED and LM filtering, generally achieved better results compared to the baseline models; however, they resulted in lower performance in smaller datasets such as BEA. This could be caused by the fact that these filtering methods have filtered out the training instances containing not only noise but also many correct corrections that may still be partially useful for training. As shown in Table 6, we analyzed the size of each training dataset after filtering. Figure 1 shows the increases and decreases in precision and recall when the performance without denoising is set as 0. The experimental results show that there was a certain pattern underlying the denoising effect. More specifically, reducing the noise by SR has little impact on the precision, but it has significantly improved the recall, indicating improved coverage of correction. We provide the detailed analysis on this question in Section 6.4.

Comparison with Existing Models
In the second experiment, we compared our best denoised model with the current best performing models to investigate whether SR works well with existing task-specific techniques. We incorporated task-specific techniques that have been widely used in shared tasks such as BEA-2019 and WMT-2019 11 into the proposed denoised model to further improve the performance. Concerning the taskspecific techniques, we followed the work reported by Kiyono et al. (2019), as detailed below.
Pre-training with pseudo data (PRET) Kiyono et al. (2019) investigated the applicability of incorporating pseudo data into the model and confirmed the reliability of their proposed settings by showing acceptable performance on several datasets. We trained the proposed model using their pre-trained model "PRETLARGE+SSE" settings 12 .
Right-to-left re-ranking (R2L) R2L is a common approach used to improve model performance by re-ranking using right-to-left models trained in the reverse direction (Sennrich et al., 2016a(Sennrich et al., , 2017 in MT. More recently, previous studies confirmed the effectiveness of this approach when applied to GEC (Ge et al., 2018;Grundkiewicz et al., 2019). We adapted R2L to the proposed model. Specifically, we generated n-best hypotheses using an   ensemble of four left-to-right (L2R) models and then re-scored the hypotheses using these models.
We then re-ranked the n-best hypotheses based on the sum of the both two scores.
Sentence-level error detection (SED) SED is used to identify whether a given sentence contains any grammatical errors. Following the work presented by Asano et al. (2019), we employed a strategy based on reducing the number of false positives by only considering sentences that contained grammatical errors in the GEC model, using an SED model. We implemented the same model employed for SED filtering.
We evaluated the performance of the proposed best denoised model incorporated with the taskspecific techniques on the three existing benchmarks: CoNLL-2014, JFLEG, and BEA-test, and then compared the scores with existing bestperforming models. Table 7 shows the results for both the single and the ensemble models after applying PRET, SED 13 , and R2L to SR 14 . Since the reference of BEA-test is publicly unavailable, 13 See Appendix F for an ablation study of SED. 14 Improved results on the CoNLL-2014 and BEA-2019 have been appeared in arXiv less than 3 months before our submission (Kaneko et al., 2020;Omelianchuk et al., 2020) that are considered contemporaneous to our submission. More detailed experimental results, including a comparison with them, are presented in Appendix E for reference.
we evaluated the models on CodaLab 15 under the rules of BEA-2019 workshop. We confirmed that our best denoised model works complementarily with existing task-specific techniques, as compared with the performance presented in Table 5. As a result, our best denoised model achieved stateof-the-art performance on the CoNLL-2014, JF-LEG, and BEA-2019 benchmarks. Noteworthy is that the proposed model achieved state-of-theart results on the JFLEG benchmark in terms of both single (GLEU = 63.3) and ensemble results (GLEU = 63.7). We provide a detailed analysis on this question in Section 6.5.

Noise Reduction
To evaluate the quality of the dataset after denoising, a researcher with a high level of English proficiency (not involved with this work) manually evaluated 500 triples of source sentences X, original target sentences Y , and generated target sentences Y obtained by applying SR to EFCamDat satisfying X = Y =Ŷ (Table 8). We can see that 73.6% of the replaced samples were determined to be appropriate corrections, including cases where both were correct. For reference, we provide examples of a confusion set before and after denoising in the 1 : Improved by denoising (66.4%) X : how about to going to movie . Y : How about to going to movie . Y : How about going to a movie .

:
Both are correct (7.2 %) X : I'm twenty-nine old. Y : I'm twenty-nine years old. Y : I'm 29 years old.

:
Meaning is not preserved (10.4 %) X : you need keep calm. Y : You need to keep calm. Y : You need to be calm.

:
Added Unnecessary information (8.8 %) X : The are a few of chair and desk. Y : There are a few chairs and desks. Y : There are a few chairs and desks too.  Appendix D.

Effect of the Fail-safe Mechanism
Next, we quantitatively and qualitatively analyzed the effectiveness of the fail-safe mechanism integrated into SR.
Quantitatively, Table 9 provides the results of the ablation study of the fail-safe mechanism on CoNLL-2014. Our main proposal was to include a self-refining step to clean up training data, but we found that the added fail-safe mechanism serves as a sub-component to further improve performance.
Qualitatively, we directly observed the decisions of the fail-safe mechanism and how it affected denoising. Table 10 provides examples for cases when SR activates and deactivates the fail-safe mechanism in EFCamDat. In the upper example (Table 10-1), *discuss of in the source sentence should have been corrected to discuss; however, it was inaccurately edited to *discuss about in the target sentence. In this case, SR succeeded in selecting the correct model output with a lower perplexity without activating the fail-safe mechanism. On the other hand, in the lower example, the model made an incorrect "correction" (*in → at). However, SR successfully activated the fail-safe mechanism and thus retained the correct original target sentence.

Benefits from Self-training
SR performed surprisingly well considering its simplicity. One reason might be that SR benefited from the advantages of self-training, as it could be regarded as a variant of self-training (Section 4). He et al. (2020) investigated the effect of self-training in neural sequence generation and found that the dropout in the pseudo-training step (namely, the training step of the denoised model in this study) played an important role in providing a smoothing effect, meaning that semantically similar inputs were mapped to the same or similar targets. As GEC also ideally holds the assumption of generating consistent targets for a similar context, this smoothing effect could contribute to avoiding overfitting and improving fitting the target distribution in the pseudo-training step. In fact, we confirmed that performance deteriorated when dropout was not applied in the training step of the denoised model, as shown in Table 11. In the case of relatively noisy data such as EFCamDat and Lang-8, the performance was better than without denoising, even without dropout. This could be explained by the presence of the denoising effect that was the objective of this study.

On the Increase of Recall
A pattern emerged when denoising with SR-recall significantly increased, while precision was mostly maintained (Figure 1). To clarify this observation, we manually assessed the amount of noise before and after denoising. Specifically, in the same way as in Section 3, we asked the expert to review 500 samples of the target sentence before denoising Y and the target sentences after denoisingŶ . We then calculated the amount of noise using WER (Eq.1). As a result, we observed a decrease in WER from 43.2% to 31.3% before and after denoising, respectively. This can be interpreted as (i) a large part of the noise was due to uncorrected errors, and (ii) the effect on model training was to correct the bias towards leaving errors unedited, resulting in higher recall.

Facilitating Fluency Edits
The results presented in Table 7 indicate that the proposed denoised model tends to (i) perform better    on JFLEG and (ii) be specifically highly rated in GLEU compared to other best-performing models. JFLEG was proposed by Napoles et al. (2017) for the development and evaluation of GEC models in terms of fluency and grammaticality, i.e., making a sentence more native sounding. Moreover, they showed that GLEU was correlated more strongly with humans than M 2 in JFLEG. The fact that SR is rated higher on JFLEG using GLEU than other bestperforming models can be interpreted as achieving more fluent editing. One reason might be that SR performs a perplexity check on both the original target sentences and the new ones obtained after denoising, which always results in PPL(Y ) PPL(Ŷ ) between D andD. Therefore, SR can be expected to refine not only grammaticality but also fluency of the target sentences, and as a result, the proposed denoised model is capable of performing more native-sounding corrections.

Conclusion and Future Work
In this study, we focused on the quality of GEC datasets. The motivation behind our study was based on the hypothesis that the carelessness or insufficient skill of the annotators involved in data annotation could often lead to producing noisy datasets. To address this problem, we presented a self-refinement approach as a simple but effective denoising method which improved GEC performance, and verified its effectiveness by comparing to both strong baselines based on filtering approach and current best-performing models. Furthermore, we analyzed how SR affects both GEC performance and the data itself.
Recently, several methods that incorporate pretrained masked language models such as BERT, XLNet (Yang et al., 2019), and RoBERTa  into EncDec based GEC have been proposed and achieved remarkable results (Kaneko et al., 2020;Omelianchuk et al., 2020). These approaches modify the model architecture and do not directly compete with the data-driven approaches discussed in this study. Thus, the combination of these methods can be expected to further improve the performance, which we plan to investigate in our future work.

C Preliminary experiment of the cross-entropy filtering
We investigated the effectiveness of changing the threshold of CE filtering by evaluating the model performance on BEA-valid. In this study, we prepared a forward and reverse pre-train model using BEA-train and CoNLL-2013 for as a training and validation set, respectively. D Examples of a confusion set before and after denoising Table 13 provides examples of a confusion set before and after applying the denoising method to EFCamDat. We confirmed that we succeeded in reducing the noisy confusion set, including (*discuss about, *discuss about) or (*enter in, *enter in) in the target sentences using the proposed denoising.    Kaneko et al. (2020) and Omelianchuk et al. (2020) have appeared on arXiv less than 3 months before our submission and are considered contemporaneous to our submission.