Improving Grammatical Error Correction with Machine Translation Pairs

We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models (e.g., Chinese to English) of different qualities (i.e., poor and good). The poor translation model can resemble the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammaticality, while the good translation model generally generates fluent and grammatically correct translations. With the pair of translation models, we can generate unlimited numbers of poor to good English sentence pairs from text in the source language (e.g., Chinese) of the translators. Our approach can generate various error-corrected patterns and nicely complement the other data synthesis approaches for GEC. Experimental results demonstrate the data generated by our approach can effectively help a GEC model to improve the performance and achieve the state-of-the-art single-model performance in BEA-19 and CoNLL-14 benchmark datasets.


Introduction
Recent work on grammatical error correction (GEC) has proved that synthetic error-corrected data is helpful for improving GEC models (Junczys-Dowmunt et al., 2018;Ge et al., 2018;Zhao et al., 2019;Lichtarge et al., 2019).However, the error patterns generated by the existing data synthesis approaches tend to be limited by either pre-defined rule sets or the seed error-corrected training data (e.g., for back-translation).To generate more diverse error patterns to further improve GEC training, we propose a novel data synthesis approach

Introduction
Most successful approaches for automated grammatical error correction (GEC) are based on approaching the task as a sequence-to-sequence problem by regarding the input ungrammatical sentences as the source language and the output corrected sentences as the target language.Based on recent advancements in neural sequence-to-sequence learning architectures (Sutskever et al. 2014;Vaswani et al. 2017), neural machine translation (NMT) based GEC models (Chollampatt and Ng 2018;Junczys-Dowmunt et al. 2018;Ge et al. 2018) achieved state-of-the-art performance on this task.While neural models yield good performance in the English GEC task with relatively large parallel corpora containing over 1 million sentence pairs, it is still insufficient for Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org).All rights reserved.

Source(Chinese) ` ae1ª~ffl Beginner Translator
You have difficulty to go to the police.Advanced Translator Go to the police if you have trouble.

Reference
You can turn to the police when having trouble.
Table 1: Examples of translation results generated by the beginner and advanced translator respectively.The beginner translator is implemented with a phrase-based statistical machine translation model with a lower language model weight; while the advanced translator is a transformer-based neural machine translation model.It is observed that the beginner translator tends to literally translate its source language to English, which is resemble the way an English learner writes English sentences; while the advanced translator is capable of generating fluent and grammatically correct sentences.
training big transformer models.More importantly, the data scarcity problem is more severe for automated grammatical error correction in other languages and unsupervised training of GEC models is thus an important problem.Considerable previous study (Ge et al. 2018 (Watcharapunyawong and Usaha 2013;Bhela 1999;Derakhshan and Karimi 2015) which investigate the interference of the native language of ESL (English as a second language) learners on English learning.They Figure 1: Examples of translations generated by the beginner and advanced translator.The beginner translator is implemented with a phrase-based statistical machine translation model with a lower language model weight; while the advanced translator is the state-ofthe-art neural machine translation model.The beginner translator tends to literally translate its source language to English, which resembles the way an English learner writes English sentences; while the advanced translator is capable of generating fluent and grammatically correct sentences.By pairing the results of beginner and advanced translators, we can harvest unlimited grammatically improved sentence pairs, as the red dashed arrow shows.
for GEC, which employs two machine translation (MT) models of different qualities.
The main idea of our approach is demonstrated in Figure 1: we use a beginner and an advanced machine translation model to translate the same sentence in a bridge language (e.g., Chinese) into English, and pair them as a pseudo error-corrected sentence pair.This idea is motivated by the studies in English language learning theory (Watcharapunyawong and Usaha, 2013; Bhela, 1999;Derakhshan and Karimi, 2015) which found that ESL (English as a second language) learners tend to compose an English sentence by literally translating from their native language with little awareness and consideration of the grammar and the expression custom in English.
In our approach, we develop a phrase-based statistical machine translation (SMT) model but reduce its language model weight to make it act as the beginner translator.With the reduced language model weight, the SMT model becomes less aware of the grammar and the expression custom in En-glish, which simulates the behaviors of ESL learners to produce less fluent translations that may contain grammatical errors.On the other hand, we employ the state-of-the-art neural machine translation model as the advanced translator which tends to produce fluent and grammatically correct translations.In this way, we can generate diverse error patterns without being limited by the pre-defined rule set and the seed error-corrected data.
We conduct experiments on both the BEA 19 (Bryant et al., 2019) and the CoNLL 2014 (Ng et al., 2014) benchmark datasets to evaluate our approach.Experimental results show that our approach effectively improves the performance of GEC models when used alone or combined with the existing data synthesis approaches.

Background: SMT vs NMT
In this section, we briefly introduce both SMT and NMT models and discuss some of their characteristics which motivate the proposed approach.
The phrase-based statistical machine translation model is based on the noisy channel model.It uses Bayes rule to reformulate the translation probability for translating a foreign sentence f into English e as: argmax e p(e|f ) = argmax e p(f |e)p(e), where p(e) corresponds to an English language model and p(f |e) is a separate phrase-based translation model.In practice, an SMT model combines the translation model with a language model by weights tuned on a validation set.The role of the language model in SMT models is to make generated translation more natural and grammatically correct.As a result, a lower language model weight may result in translations which are adequate but of low fluency and contain many grammatical errors, which motivates us to manually decrease the language model weight in the SMT translation model.
In contrast, neural machine translation models based on sequence-to-sequence architecture are optimized by directly maximizing the likelihood of the target sentences given source sentences p(e|f ), which simultaneously learn to model the target language and learning to translate.This enables NMT models to outperform SMT and generate translations that are both adequate and fluent.
We conduct a preliminary experiment to better support this intuition.In the experiment, we evaluate the fluency of translations produced by SMT and NMT model.Compared SMT and NMT model translate a same set of Chinese sentences into En-glish, and the fluency of resulting translations are measured by their perplexity under GPT-2 (Radford et al., 2019), a pretrained language model.We find the perplexity of translations produced by SMT and NMT model to be 24.4 and 15.7 respectively, which confirms our assumption.

Method
In this paper, motivated by the data scarcity problem in the GEC task and the comparison between SMT and NMT models, we propose a novel data synthesis method for training GEC models with a pair of Beginner Translator and Advanced Translator.The beginner translator is a relatively poor machine translation model which tends to generate unnatural translations containing many grammatical errors.The advanced translator, in contrast, is well trained and generally outputs fluent and grammatically correct translations.Both translation models are trained to translate sentences from a same bridge language to the target language in which we want to train the GEC model (e.g.English).After training the pair of translation models, we can synthesize pseudo parallel data for training GEC model by feeding sentences from monolingual corpora in the bridge language into the pair of translation models and take the output from the beginner translator and the advanced translator as the source and target sentences for GEC task respectively.We choose Chinese as the bridge language in our experiments as it is less similar to English, thus may be able to cover more error patterns similar to those generated by non-native English speakers.The overall architecture of the proposed method is illustrated in Figure 2.

Beginner Translator
The beginner translator is expected to be able to produce translations which are meaning-preserving with respect to the input sentences but of low fluency and contain many grammatical errors.Motivated by the observation that phrase-based statistical machine translation models usually yield adequate but non-fluent translations which contain many grammatical errors (Wang et al., 2017), we propose to model the beginner translator in our model with a phrase-based statistical machine translation model.Indeed, we find that sometimes translation generated by SMT models resembles that written by non-native speakers in the way that they both have meaningful phrases but combined in an unnatural way and are sometimes grammatically incorrect.
In addition, based on the observation in previous study (Qiu and Park, 2019) that synthesized parallel data can help GEC training more effectively when source sentences are of lower fluency, we propose to manually reduce the weight of language model in the tuned beginner translator (i.e.statistical machine translation model).The SMT model combines a phrase dictionary, which tends to make translation more adequate, and a language model, which makes translation more fluent and grammatically correct, with tuned weights indicating their relative importance.Reducing the language model weight in the beginner translator will result in translations with more grammatical errors, which may help train GEC models.We present an example of translations generated by the same SMT model with different language model weights in Table 1.

Advanced Translator
In contrast to the beginner translator, the advanced translator in the proposed method is expected to be able to produce "valid translations" which are meaning-preserving, fluent and grammatically correct.Neural machine translation models with sequence-to-sequence architecture (Vaswani et al., 2017) is thus considered to be suitable for modeling the advanced translator.
In addition, as available parallel corpora for machine translation are generally larger and cheaper to construct, it would be helpful if this large amount of data can be used for training GEC models.Our method can easily convert parallel corpora between the bridge language and English into GEC training data by taking translations of sentences in the bridge language from the beginner translator as source sentences and use ground-truth translations in the parallel corpora as an "oracle translator", which take place of the NMT-based advanced translator.Synthetic data generated with this approach resembles that used for training automatic postediting models for machine translation.The dif-ference is that we use translations of low fluency and contain many grammatical errors as source sentences whereas source sentences for training automatic post-editing models are generally less adequate but grammatically correct.

Evaluation
We conduct experiments on the BEA 19 shared task on GEC and also report results on the CoNLL-2014 test set.As our primary goal is to explore and analyze the effect of pretraining with synthetic parallel data generated in the proposed approach, we do not incorporate extensive tricks used in most state-ofthe-art GEC models, including iterative decoding, model ensembling, edit-weighted MLE objective, right-to-left re-ranking, external spell checker, etc.As these tricks are orthogonal to our data synthesis method and most previous GEC models with these tricks are not open-sourced, we do not compare our model against them and only focus on the influence of data synthesis methods on GEC models.Following previous works (Junczys-Dowmunt et al., 2018;Lichtarge et al., 2019), we evaluate the performance of trained GEC models measured by F 0.5 on test sets using official scripts in both datasets.

Beginner Translator Model
In this work, we use Moses (Koehn et al., 2007), an open-source toolkit for statistical machine translation to implement the phrase-based SMT model for the beginner translator.We use MGIZA++ for word-aligning and use KenLM for training a trigram language model on the target sentences of the parallel corpora.We tune the weights of each component in the Moses system (e.g.phrase table, language model, etc.) using MERT (Och, 2003) to optimize the system's BLEU (Papineni et al., 2002) on separated development data.After tuning, we create two replicas of the tuned model by manually increasing or decreasing the weight of the language model by a factor of 50%.We denote the tuned SMT model which achieves highest BLEU score by SMT and two replicas with decreased or increased language model score by SMT tuned , SMT low and SMT high respectively.

Advanced Translator Model
We use the Transformer-based NMT model as the advanced translator.Specifically, we use the "trans-former big" architecture which uses 6 layers for both the encoder and the decoder, 16 attention heads, embedding size d model = 1024, a positionwise feed-forward network at every layer of inner size d f f = 4096.Chinese sentences are segmented into word-level and English word tokens are split into subwords using byte-pair encoding technique (Sennrich et al., 2015).

GEC Model
In our work, we use the same "transformer big" architecture as our GEC model with tied output layer, decoder embedding, and encoder embeddings.Following previous works, we use the Adam optimizer with learning rate set to 0.0002 and linear warm-up for the first 8k updates.Both input and output sentences are tokenized with byte-pair encoding with shared codes.

Data
We train translation models on the Chinese-English parallel data in the UN Corpus (Ziemski et al., 2016), which contains approximately 15M Chinese-English parallel sentence pairs with around 400M tokens.The monolingual Chinese corpora used to synthesize GEC data is collected from news2016zh (Xu, 2019), which is a news corpus containing 2.5M Chinese news articles.
In our experiments, we synthesize 10M pseudoparallel data with the pair of beginner translator and advanced translator for unsupervised GEC training.With available Chinese-English parallel corpora, we are able to additionally synthesize 10M sentence pairs by pairing the beginner translator and ground-truth translation, as SMT models decode much faster than large NMT models.
In addition, to compare our approach with the commonly used corruption-based data synthesis approach, we also randomly corrupt monolingual sentences from NewsCrawl dataset following the approach in (Zhao et al., 2019) and synthesize 40M sentence pairs as pseudo-parallel data for unsupervised GEC training.A potential issue is that in contrast to existing approaches, the target sentences in the synthesized corpora are generated rather than human written, thus may also contain grammatical errors, this may introduce some noises in the target side and affect the precision of trained GEC model.
Following previous work (Ge et al., 2018), we filter the generated corpora based on the fluency, which is measured by their perplexity under a language model, of the sentences in both source and target side and filter out all sentence pairs in which the fluency of target sentence is lower than that of the source sentence.We also discard 20% of remaining sentence pairs with the lowest fluency improvement between source sentence and target sentence.

Experimental Results
To validate the effectiveness of the proposed data synthesis approach, we train GEC models in different settings including unsupervised training exclusively with synthetic data generated by our approach, with combination of randomlycorrupted monolingual data and data generated with ourapproach, and fine-tune the unsupervised GEC models with GEC parallel corpora.

Performance of translation models
We first investigate the performance of different translation models used in our experiments to confirm the motivation of our approach.We present the BLEU score of different variant of the beginner translator and the advanced translator on the newstest17 dataset in Table2.We see that the performance of our advanced translator are much better than beginner translators.Moreover, we find that the manually decreasing the language model weight in the beginner translator results in a worse BLEU score, which may indicate more grammatical errors in translated sentences.

Results on unsupervised GEC training
We first evaluate the performance of the unsupervised GEC models trained exclusively with data synthesized by our approach.To fairly compare our method against the corruption-based approach, we also train a separate GEC model with the same architecture and hyperparameters with 20M sentence pairs synthesized by corrupting English news articles, which ensures that both the amount and the domain of training data is as close as possible.We also compare our model with another unsupervised GEC model trained with unsupervised SMT technique (Katsumata and Komachi, 2019).
In practice, different data synthesis methods are often combined to generate better training data, as they may introduce different error patterns, which may be complementary with each other and improve the coverage of error types included in the synthesized corpora.To explore whether the proposed data synthesis approach can be effectively combined with existing approaches, we also con- duct an experiment where we pretrain a GEC model with synthesized data constructed by combining 20M sentence pairs synthesized by our approach with 20M sentence pairs synthesized by random corruption.We also train a GEC model with 40M corrupted sentence pairs for comparison.
The results are shown in Table 3.We can see that unsupervised GEC models with our proposed approach outperform both the commonly used corruption-based method and the GEC model based on unsupervised SMT by a large margin.This suggests that error-corrected data synthesized with our approach may contain more realistic errors compared with predefined rules.We also find that syn- thetic data generated by two compared approaches are supplementary with each other by the fact that combining both data sources together yields further improvement and outperforms the baseline in which 40M corrupted monolingual sentence pairs are used for unsupervised training.
In addition, the influence of the language model weight in the beginner translator on the final performance of unsupervised GEC models shows that decreasing the language model weight in the beginner translator model yields better results.

Fine-tuning Results
We also conduct experiments to explore to what extent our data synthesis method can improve supervised GEC models, which is trained by fine-tuning previously trained unsupervised models with parallel corpora.We use the public available Lang-8 (Mizumoto et al., 2011) and NUCLE (Dahlmeier et al., 2013) for fine-tuning the pretrained models.
The results are shown in Table 4.We can see that the final GEC model pretrained with data generated by our proposed approach outperforms both GEC model trained exclusively with parallel corpora and that pretrained with commonly used corruptionbased method by a large margin.This confirms that our approach is able to generate more realistic errors.Combining both synthetic data sources yields consistent improvements, which further confirms the usefulness of the proposed approach.
In addition, the influence of the language model weight in the beginner translator on the final performance of GEC models are consistent with that found in the unsupervised training results, which confirms that decreasing automatically tuned language model weight in the beginner translator may help train GEC models better.

Qualitative Analysis
To better analyze data synthesized by our approach and understand why it works, we present several examples generated by our approach, together with examples generated by applying random corruption in Table 5.First, we can see that translations produced by the advanced translator are generally of good quality and are very similar to the groundtruth translations.This ensures that target sentences in the synthetic corpora are generally grammatically correct, which is very important for training GEC models.By comparing erroneous sentences generated by our approach and random corruption, we find that random corruption only introduces very limited artificial errors such as repetition and deletion of tokens, which is limited in the tokenlevel errors.In contrast, our method is able to introduce much more realistic errors which resemble that generated by ESL learners and contain span-level errors, which are also very important for training GEC models.
In addition, by comparing translations produced by the beginner translator with different language model weights, we find that when manually increasing the language model weight in the SMT model, the resulting translations tend to be more fluent and contain less grammatical errors, but are less adequate and often ignore some information in the source sentence.In contrast, the SMT model with decreased language model weight tends to generate translations which are meaning-preserving but contain massive grammatical errors.This may explain why decreasing language model weight in the beginner translation model yields better performance in unsupervised GEC model training.

Ablation Study
As synthetic data generated with the proposed approach combines target sentence generated by the advanced translator and ground-truth sentences, we perform an ablation study to analyze the relative importance of each component in our method.In this  experiment, we use SMT low configuration for the beginner translator model which tends to yield better performance and does not combine corruptionbased synthetic data.
The results are shown in Table 6.We can see that both beginner-advanced translator pairs and beginner-gold translation pairs significantly contribute to GEC model training because the performance substantially degrades when training without them.SMT-gold sentence pairs are slightly more effective, which may be because the target sentences contain fewer noises than that in SMT-NMT pairs.However, as machine translation parallel corpora are limited in both size and domain, our data synthesis method based on beginner-advanced translator pairs is more general and flexible when used for GEC data synthesis.

Related Work
Parallel error-corrected corpora are limited to less than 2 million sentence pairs for English GEC, which is insufficient to train large neural models to achieve better results.Moreover, no existing parallel corpora for GEC in other languages are available.Motivated by the data scarcity problem, various data synthesis methods have been proposed for generating large pseudo-parallel data for pre-training neural GEC models.We will introduce these approaches and discuss their pros and cons in this section.

Rule-based Monolingual Corpora Corruption
A straightforward data synthesis method is to corrupt monolingual corpora with pre-defined rules and pretrain neural GEC models as denoising autoencoder (Zhao et al., 2019).The advantage of this approach is that it is very simple and efficient to generate parallel data from monolingual corpora.However, manually designed rules are limited and only cover a small portion of grammatical error types written by ESL learners.Pretraining solely with corrupted monolingual data may thus limit the performance improvement.

Back-translation based Error Generation
The main idea of this approach is to train an error generation model by using the existing errorcorrected corpora in the opposite direction (Ge et al., 2018).That's to say, the error generation model is trained to take a correct sentence as input and outputs an erroneous version of the original sentence.The trained error generation model is then used to synthesize error-corrected data by taking monolingual corpora as input.This approach is potentially cover more diverse error types compared with rule-based corruption method.However, training a good error generation model also requires a large amount of annotated error-corrected data, which makes it also suffer from the data scarcity problem.
Data Generation from Round-trip Translations Round-trip translation (Lichtarge et al., 2019) is an alternative approach to synthesis GEC data.This approach uses two machine translation models, one from English to a bridge language and the other from the bridge language to English.Therefore, the original sentence from monolingual corpora is the target sentence and output of the round-trip translation is the corresponding source sentence.However, according to Lichtarge et al., when good translation models are employed, the resulting source sentences would be quite clean and the coverage over error types is thus limited.When relatively poor machine translation models are employed, the resulting source sentences may have a large semantic difference from target sentences.Moreover, we observe that the noise introduced by round-trip translation is more paraphrase-like or information loss rather than grammatical errors.
Data Generation from Wikipedia Revision Histories Another data synthesis approach is to extract revision histories from Wikipedia.In contrast to the aforementioned approaches, this method is able to collect human-made revisions, which may resemble real error-corrected data better.However, the vast majority of extracted revisions are not grammatical error corrections, which makes the synthesized data quite noisy and requires sophisticated filtering in order to be used for pretraining.Another issue is that the domain of revision history is limited and may be different from the target GEC domain.

Discussion
We propose a method to synthesize large parallel corpora for training GEC models.Our model consists of a pair of beginner translator and advanced translator.Unsupervised GEC models can be trained by taking translation pairs generated by the beginner and advanced translator as the source and target sentences.
We conduct experiments and show that unsupervised GEC models trained exclusively on synthetic data generated by our approach can yield reasonable performance on the GEC task and the performance can be further improved by manually decreasing the weight of the language model in the beginner translator.As our approach can be easily extended to many other languages, it may further benefit GEC in languages where no extant GEC corpora are available.
For future work, we plan to investigate the influence of different bridge languages on the performance of unsupervised GEC models.We will also generate larger pseudo parallel GEC corpora with our approach and investigate whether and to what extent larger synthetic corpora can yield further improvements.

Figure 2 :
Figure2: Our approach is composed of a beginner translator, which is modeled by phrase-based statistical machine translation model, and an advanced translator, which is modeled by a neural machine translation model.Two machine translation models are trained with parallel machine translation corpora from a bridge language to English.Pseudo-parallel GEC corpora are synthesized by taking translation pairs of monolingual corpora in the bridge language from the beginner and the advanced translator as error-corrected sentence pairs for training GEC models.(best view in color)

Table 4 :
Performance (i.e.F 0.5 score) of supervised GEC models fine-tuned based on a pretrained model with various unsupervised synthesized parallel data.GEC model with SMT high , SMT low , and SMT tuned denotes variants where the language model weight in the beginner translator is manually increased, decreased or maintains the default value.Ours denotes synthetic data generated with our approach, Corruption denotes synthetic data generated by random corruption, and Fine denotes Lang-8 and NUCLE parallel data.
Translation from SMT high "In any case , I am satisfied with the performance."Translation from SMT tune "In any event , I have everyone is very satisfactory performance ."Translation from SMT low "Regardless of whether such to what , I am very satisfied with both the performance of together."Translation from NMT "Anyway , I am satisfied with everyone's performance ."Ground-truth Translation "Anyway , I am very satisfied with everyone's performance ."Translation from SMT high "In order to participate, in advance to the list ."Translation from SMT tune "For ensure that your regular participants, please sign up the name in advance " Translation from SMT low "In order to ensure that the participants in the meeting , requests you to in advance of the list ."Translation from SMT high "In general it is truth and unaware of AI." Translation from SMT tune "It is this public when the technology and knowledge of AI is not understand." Translation from SMT low "Popular understanding of AI technology not at the time is even more true ."Translation from NMT "This is particularly true when the public does not understand AI technology ."Ground-truth Translation "This is especially true when the public does not understand AI technology ."Corrupted Translation "This was especially true when a public does understand AI technology ."Source Sentence "症状严重时, 及时到医院接受治疗."Translation from SMT high "Ihe serious symptoms of acute hospital immediately."Translation from SMT tune "When the symptoms of a serious , prompt medical treatment to hospital."Translation from SMT low "At the time, in a timely manner, to the serious symptoms to receive treatment in hospitals ."Translation from NMT "When symptoms are serious , go to the hospital in time for treatment ."Ground-truth Translation " When the symptoms are serious , go to the hospital for treatment in time ."Corrupted Translation "When the symptoms serious , go to to the hospital for treatment time ."

Table 5 :
Examples of translations generated by the beginner translator and the advanced translator, together with the original and the corrputed version of ground-truth translation.