An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performance is achieved on the CoNLL-2014 test set (F0.5=65.0) and the official test set of the BEA-2019 shared task (F0.5=70.2) without making any modifications to the model architecture.


Introduction
To date, many studies have tackled grammatical error correction (GEC) as a machine translation (MT) task, in which ungrammatical sentences are regarded as the source language and grammatical sentences are regarded as the target language. This approach allows cutting-edge neural MT models to be adopted. For example, the encoder-decoder (EncDec) model (Sutskever et al., 2014;Bahdanau et al., 2015), which was originally proposed for MT, has been applied widely to GEC and has achieved remarkable results in the GEC research field (Ji et al., 2017;Chollampatt and Ng, 2018;. However, a challenge in applying EncDec to GEC is that EncDec requires a large amount of training data (Koehn and Knowles, 2017), but the largest set of publicly available parallel data  in GEC has only two million sentence pairs (Mizumoto et al., 2011). Consequently, the method of augmenting the data by incorporating pseudo training data has been studied intensively (Xie et al., 2018;Ge et al., 2018;Lichtarge et al., 2019;Zhao et al., 2019). * Current affiliation: Future Corporation When incorporating pseudo data, several decisions must be made about the experimental configurations, namely, (i) the method of generating the pseudo data, (ii) the seed corpus for the pseudo data, and (iii) the optimization setting (Section 2). However, consensus on these decisions in the GEC research field is yet to be formulated. For example, Xie et al. (2018) found that a variant of the backtranslation (Sennrich et al., 2016b) method (BACKTRANS (NOISY)) outperforms the generation of pseudo data from raw grammatical sentences (DIRECTNOISE). By contrast, the current state of the art model (Zhao et al., 2019) uses the DIRECTNOISE method.
In this study, we investigate these decisions regarding pseudo data, our goal being to provide the research community with an improved understanding of the incorporation of pseudo data. Through extensive experiments, we determine suitable settings for GEC. We justify the reliability of the proposed settings by demonstrating their strong performance on benchmark datasets. Specifically, without any task-specific techniques or architecture, our model outperforms not only all previous single-model results but also all ensemble results except for the ensemble result by Grundkiewicz et al. (2019) 1 . By applying task-specific techniques, we further improve the performance and achieve state-of-the-art performance on the CoNLL-2014 test set and the official test set of the BEA-2019 shared task.

Problem Formulation and Notation
In this section, we formally define the GEC task discussed in this paper. Let D be the GEC training data that comprise pairs of an ungrammatical source sentence X and grammatical target sentence Y , i.e., D = {(X n , Y n )} n . Here, |D| denotes the number of sentence pairs in the dataset D.
Let Θ represent all trainable parameters of the model. Our objective is to find the optimal parameter set Θ that minimizes the following objective function L(D, Θ) for the given training data D: where p(Y |X, Θ) denotes the conditional probability of Y given X.
In the standard supervised learning setting, the parallel data D comprise only "genuine" parallel data D g (i.e., D = D g ). However, in GEC, incorporating pseudo data D p that are generated from grammatical sentences Y ∈ T , where T represents seed corpus (i.e., a set of grammatical sentences), is common (Xie et al., 2018;Zhao et al., 2019;Grundkiewicz et al., 2019).
Our interest lies in the following three nontrivial aspects of Equation 1. Aspect (i): multiple methods for generating pseudo data D p are available (Section 3). Aspect (ii): options for the seed corpus T are numerous. To the best of our knowledge, how the seed corpus domain affects the model performance is yet to be shown. We compare three corpora, namely, Wikipedia, Simple Wikipedia (SimpleWiki) and English Gigaword, as a first trial. Wikipedia and SimpleWiki have similar domains, but different grammatical complexities. Therefore, we can investigate how grammatical complexity affects model performance by comparing these two corpora. We assume that Gigaword contains the smallest amount of noise among the three corpora. We can therefore use Gigaword to investigate whether clean text improves model performance. Aspect (iii): at least two major settings for incorporating D p into the optimization of Equation 1 are available. One is to use the two datasets jointly by concatenating them as D = D g ∪ D p , which hereinafter we refer to as JOINT. The other is to use D p for pretraining, namely, minimizing L(D p , Θ) to acquire Θ , and then fine-tuning the model by minimizing L(D g , Θ ); hereinafter, we refer to this setting as PRETRAIN. We investigate these aspects through our extensive experiments (Section 4).

Methods for Generating Pseudo Data
In this section, we describe three methods for generating pseudo data. In Section 4, we experimentally compare these methods.

BACKTRANS (NOISY) and BACKTRANS (SAM-PLE)
Backtranslation for the EncDec model was proposed originally by Sennrich et al. (2016b). In backtranslation, a reverse model, which generates an ungrammatical sentence from a given grammatical sentence, is trained. The output of the reverse model is paired with the input and then used as pseudo data.
BACKTRANS (NOISY) is a variant of backtranslation that was proposed by Xie et al. (2018) 2 . This method adds rβ random to the score of each hypothesis in the beam for every time step. Here, noise r is sampled uniformly from the interval [0, 1], and β random ∈ R ≥0 is a hyper-parameter that controls the noise scale. If we set β random = 0, then BACK-TRANS (NOISY) is identical to standard backtranslation.
BACKTRANS (SAMPLE) is another variant of backtranslation, which was proposed by Edunov et al. (2018) for MT. In BACKTRANS (SAMPLE), sentences are decoded by sampling from the distribution of the reverse model. DIRECTNOISE Whereas BACKTRANS (NOISY) and BACKTRANS (SAMPLE) generate ungrammatical sentences with a reverse model, DIRECT-NOISE injects noise "directly" into grammatical sentences (Edunov et al., 2018;Zhao et al., 2019). Specifically, for each token in the given sentence, this method probabilistically chooses one of the following operations: (i) masking with a placeholder token mask , (ii) deletion, (iii) insertion of a random token, and (iv) keeping the original 3 . For each token, the choice is made based on the categorical distribution (µ mask , µ deletion , µ insertion , µ keep ).

Experiments
The goal of our experiments is to investigate aspect (i)-(iii) introduced in Section 2. To ensure that the experimental findings are applicable to GEC in general, we design our experiments by using the following two strategies: (i) we use an off-the-shelf EncDec model without any task-specific architecture or techniques; (ii) we conduct hyper-parameter tuning, evaluation and comparison of each method or setting on the validation set. At the end of experiments (Section 4.5), we summarize our findings and propose suitable settings. We then perform a single-shot evaluation of their performance on the test set.  Table 1: Summary of datasets used in our experiments. Dataset marked with "*" is a seed corpus T .

Experimental Configurations Dataset
The BEA-2019 workshop official dataset 4 is the origin of the training and validation data of our experiments. Hereinafter, we refer to the training data as BEA-train. We create validation data (BEA-valid) by randomly sampling sentence pairs from the official validation split 5 .
As a seed corpus T , we use SimpleWiki 6 , Wikipedia 7 or Gigaword 8 . We apply the noizing methods described in Section 3 to each corpus and generate pseudo data D p . The characteristics of each dataset are summarized in Table 1. Evaluation We report results on BEA-valid, the official test set of the BEA-2019 shared task (BEA-test), the CoNLL-2014 test set (CoNLL-2014) (Ng et al., 2014), and the JFLEG test set (JFLEG) (Napoles et al., 2017). All reported results (except ensemble) are the average of five distinct trials using five different random seeds. We report the scores measured by ERRANT (Bryant et al., 2017;Felice et al., 2016) for BEA-valid, BEA-test, and CoNLL-2014. As the reference sentences of BEAtest are publicly unavailable, we evaluate the model outputs on CodaLab 9 for BEA-test. We also report results measured by the M 2 scorer (Dahlmeier and Ng, 2012) on CoNLL-2014 to compare them with those of previous studies. We use the GLEU metric (Napoles et al., 2015(Napoles et al., , 2016 for JFLEG. Model We adopt the Transformer EncDec model (Vaswani et al., 2017) using the fairseq toolkit (Ott et al., 2019) and use the "Transformer (big)" settings of Vaswani et al. (2017). Optimization For the JOINT setting, we opti-  mize the model with Adam (Kingma and Ba, 2015). For the PRETRAIN setting, we pretrain the model with Adam and then fine-tune it on BEA-train using Adafactor (Shazeer and Stern, 2018) 10 .
The results are summarized in Table 2. BACK-TRANS (NOISY) and BACKTRANS (SAMPLE) show competitive values of F 0.5 . Given this result, we exclusively use BACKTRANS (NOISY) and discard BACKTRANS (SAMPLE) for the rest of the experiments. The advantage of BACKTRANS (NOISY) is that its effectiveness in GEC has already been demonstrated by Xie et al. (2018). In addition, in our preliminary experiment, BACKTRANS (NOISY) decoded ungrammatical sentence 1.2 times faster than BACKTRANS (SAMPLE) did. We also use DI-RECTNOISE because it achieved the best value of F 0.5 among all the methods.

Aspect (ii): Seed Corpus T
We investigate the effectiveness of the seed corpus T for generating pseudo data D p . The three corpora (Wikipedia, SimpleWiki and Gigaword) are compared in Table 3. We set |D p | = 1.4M. The difference in F 0.5 is small, which implies that the seed corpus T has only a minor effect on the model performance. Nevertheless, Gigaword consistently outperforms the other two corpora. In particular,  DIRECTNOISE with Gigaword achieves the best value of F 0.5 among all the configurations.

Aspect (iii): Optimization Setting
We compare the JOINT and PRETRAIN optimization settings. We are interested in how each setting performs when the scale of the pseudo data D p compared with that of the genuine parallel data D g is (i) approximately the same (|D p | = 1.4M) and (ii) substantially bigger (|D p | = 14M). Here, we use Wikipedia as the seed corpus T instead of SimpleWiki or Gigaword for two reasons. First, SimpleWiki is too small for the experiment (b) (see Table 1). Second, the fact that Gigaword is not freely available makes it difficult for other researchers to replicate our results.
(a) Joint Training or Pretraining Table 4 presents the results. The most notable result here is that PRETRAIN demonstrates the properties of more pseudo data and better performance, whereas JOINT does not. For example, in BACKTRANS (NOISY), increasing |D p | (1.4M → 14M) improves F 0.5 on PRETRAIN (41.1 → 44.5). By contrast, F 0.5 does not improve on JOINT (40.4 → 40.3). An intuitive explanation for this case is that when pseudo data D p are substantially more than genuine data D g , the teaching signal from D p becomes dominant in JOINT. PRETRAIN alleviates this problem because the model is trained with only D g during fine-tuning. We therefore suppose that PRETRAIN is crucial for utilizing extensive pseudo data.
(b) Amount of Pseudo Data We investigate how increasing the amount of pseudo data affects the PRETRAIN setting. We pretrain the model with different amounts of pseudo data {1.4M, 7M, 14M, 30M, 70M}. The results in Figure 1 show that BACKTRANS (NOISY) has superior sample efficiency to DIRECTNOISE. The best model (pretrained with 70M BACKTRANS (NOISY)) achieves   F 0.5 = 45.9.

Comparison with Current Top Models
The present experimental results show that the following configurations are effective for improving the model performance: (i) the combination of JOINT and Gigaword (Section 4.3), (ii) the amount of pseudo data D p not being too large in JOINT (Section 4.4(a)), and (iii) PRETRAIN with BACK-TRANS (NOISY) using large pseudo data D p (Section 4.4(b)). We summarize these findings and attempt to combine PRETRAIN and JOINT. Specifically, we pretrain the model using 70M pseudo data of BACKTRANS (NOISY). We then fine-tune the model by combining BEA-train and relatively small DIRECTNOISE pseudo data generated from Gigaword (we set |D p | = 250K). However, the performance does not improve on BEA-valid. Therefore, the best approach available is simply to pretrain the model with large (70M) BACKTRANS (NOISY) pseudo data and then fine-tune using BEAtrain, which hereinafter we refer to as PRETLARGE. We use Gigaword for the seed corpus T because it has the best performance in Table 3. We evaluate the performance of PRETLARGE on test sets and compare the scores with the current top models. Table 5 shows a remarkable result, that is,   Grundkiewicz et al. (2019).
To further improve the performance, we incorporate the following techniques that are widely used in shared tasks such as BEA-2019 and WMT 13 : Synthetic Spelling Error (SSE) Lichtarge et al. (2019) proposed the method of probabilistically injecting character-level noise into the source sentence of pseudo data D p . Specifically, one of the following operations is applied randomly at a rate of 0.003 per character: deletion, insertion, replacement, or transposition of adjacent characters. Right-to-left Re-ranking (R2L) Following Sennrich et al. (2016aSennrich et al. ( , 2017; Grundkiewicz et al. (2019), we train four right-to-left models. The ensemble of four left-to-right models generate n-best candidates and their corresponding scores (i.e., conditional probabilities). We then pass each candidate to the ensemble of the four right-to-left models and compute the score. Finally, we re-rank the n-best candidates based on the sum of the two scores. Sentence-level Error Detection (SED) SED classifies whether a given sentence contains a grammatical error. Asano et al. (2019) proposed incorporating SED into the evaluation pipeline and reported improved precision. Here, the GEC model is applied only if SED detects a grammatical error in the given source sentence. The motivation is that SED could potentially reduce the number of false-positive errors of the GEC model. We use the re-implementation of the BERT-based SED model (Asano et al., 2019). Table 5 presents the results of applying SSE, 13 http://www.statmt.org/wmt19/ R2L, and SED. It is noteworthy that PRET-LARGE+SSE+R2L achieves state-of-the-art performance on both CoNLL-2014 (F 0.5 = 65.0) and BEA-test (F 0.5 = 69.8), which are better than those of the best system of the BEA-2019 shared task (Grundkiewicz et al., 2019). In addition, PRET-LARGE+SSE+R2L+SED can further improve the performance on BEA-test (F 0.5 = 70.2). However, unfortunately, incorporating SED decreased the performance on CoNLL-2014 and JFLEG. This fact implies that SED is sensitive to the domain of the test set since the SED model is fine-tuned with the official validation split of BEA dataset. We leave this sensitivity issue as our future work.

Conclusion
In this study, we investigated several aspects of incorporating pseudo data for GEC. Through extensive experiments, we found the following to be effective: (i) utilizing Gigaword as the seed corpus, and (ii) pretraining the model with BACKTRANS (NOISY) data. Based on these findings, we proposed suitable settings for GEC. We demonstrated the effectiveness of our proposal by achieving stateof-the-art performance on the CoNLL-2014 test set and the BEA-2019 test set.