Automatic Grammatical Error Correction for Sequence-to-sequence Text Generation: An Empirical Study

Sequence-to-sequence (seq2seq) models have achieved tremendous success in text generation tasks. However, there is no guarantee that they can always generate sentences without grammatical errors. In this paper, we present a preliminary empirical study on whether and how much automatic grammatical error correction can help improve seq2seq text generation. We conduct experiments across various seq2seq text generation tasks including machine translation, formality style transfer, sentence compression and simplification. Experiments show the state-of-the-art grammatical error correction system can improve the grammaticality of generated text and can bring task-oriented improvements in the tasks where target sentences are in a formal style.


Introduction
Sequence-to-sequence (seq2seq) text generation (Cho et al., 2014;Sutskever et al., 2014) has attracted growing attention in natural language processing (NLP). Despite various advantages of seq2seq models, they tend to have a weakness: there is no guarantee that they can always generate sentences without grammatical errors. Table 1 shows examples generated by seq2seq models in various tasks with grammatical errors.
One valid solution to this challenge is conducting grammatical error correction (GEC) for machine generated sentences. Recent GEC systems (Chollampatt and Ng, 2018;Ge et al., 2018a,b) can achieve human-level performance in GEC benchmarks. We are curious whether they can help improve seq2seq based natural language generation (NLG) models. We therefore propose an empirical study on GEC post editing for various text generation tasks (i.e., machine translation, style transfer, sentence compres-  sion and simplification) using both automatic and human evaluation methods. Experimental results demonstrate that a state-of-the-art GEC system is helpful for improving the grammaticality of generated text and that it can bring task-oriented improvements in the tasks where target sentences are in a formal style.
The contributions of this paper are twofold: • We present an empirical study on GEC post editing for seq2seq text generation. To the best of our knowledge, it is the first work to study improving seq2seq based NLG models using GEC.
• We show some interesting results by thoroughly comparing and analyzing GEC post editing for various seq2seq text generation tasks, shedding light on the potential of GEC for NLG.

Sequence-to-sequence Text Generation
The sequence-to-sequence (seq2seq) framework has been proven to be successful for many NLP tasks. Given a source sentence x s , a seq2seq model learns to predict its target sentence x t . It usually has an encoder to learn the representation of x s and a decoder to generate x t based on the encoded representation of x s . The model is usually trained by minimizing the negative loglikelihood of the training source-target sentence pairs. During inference, an output sequence x o is generated (one token at a time) with beam search by maximizing P Θ (x o |x s ).

Automatic Grammatical Error Correction
Most recent GEC systems are based on the seq2seq framework and are trained with errorcorrected sentence pairs. Due to massive training data, the state-of-the-art GEC system Ge et al., 2018b) can achieve human-level performance in GEC benchmarks and be practically used for correcting grammatical errors.

Experiments and Evaluation
We use the state-of-art GEC system (Ge et al., 2018b) as our GEC model which is a 7-layer convolutional seq2seq model trained with a fluency boost learning strategy on both original GEC training data and augmented fluency boost sentence pairs. We use the GEC model to do post editing for sentences decoded by a seq2seq model to test if GEC improves the results. We choose machine translation, style transfer, sentence compression and simplification as typical seq2seq text generation tasks. Due to the page limit, the detailed configuration of the models we implemented in this section are put in the supplementary notes.

Machine translation
We take Machine translation (MT) as the main task to study whether GEC helps improve translation quality. We conduct experiments by using GEC to edit the results of the state-of-the-art neural machine translation (NMT) system (Google Translate) on the French-English (FR-EN) in WMT14, German-English (DE-EN) and Chinese-English (ZH-EN) news test sets in WMT17. Table 2 shows BLEU with/without post-editing by the GEC system. Although GEC post-editing does not improve BLEU much, when we look into the results by analyzing the sentences edited by GEC, we observe only a small proportion of sentences are modified by the GEC system -approximately 5% in FR-EN and DE-EN, while 10% in ZH-EN test sets. The sentence-level BLEU of around 50% of the edited sentences are improved,   while the remaining suffer a BLEU decrease.
To understand the reasons for the BLEU changes, we manually check each sentence edited by GEC in WMT14 FR-EN dataset and show the results in Table 4. The main reason (90.5% cases) for a BLEU improvement is that GEC corrects errors in NMT's results and improves the translation quality. In contrast, the reasons why BLEU decreases are various. First, the correction of grammatical errors by GEC may decrease BLEU though it improves the sentence's grammaticality, as shown in Table 4. Second, the GEC system is not perfect: it sometimes edits a sentence without grammatical errors. Even though such edits usually bring no adverse effects, it is likely to decrease BLEU. Last, we find reference sentences occasionally have grammatical errors, as Reference Error in Table 4 shows. When GEC fixes the errors in such cases, BLEU decreases.
Moreover, we test the effects of GEC on MT in a low resource setting. We use the state-of-the-art unsupervised SMT and NMT model in Ren et al. (2019) and use the GEC system to edit their results. According to the results shown in Table 3, the unsupervised MT systems benefit more from GEC than the state-of-the-art supervised NMT (i.e, Google translate) because they are more likely to generate sentences that are not fluent than the supervised MT models, which can be addressed by GEC.
We also conduct experiments on the WMT17 Automatic Post-Editing (APE) task. However, we observe a large number of grammatical errors in the references which make the automatic evaluation less reliable. We include the results in the supplementary notes due to the page limit.

Formality style transfer
In addition to MT, we test GEC on the text style transfer task. We study formality style transfer which transfers an informal (formal) sentence to a formal (informal) style and choose GYAFC corpus (Rao and Tetreault, 2018) as our testbed. We use a 2-layer transformer model as our base model and train a model with approximately 100K parallel sentences in the GYAFC corpus for informal→ informal and formal→informal respectively. We use the GEC model to edit the base models' outputs, and show the result in Table 5. While GEC improves BLEU in both transfer directions, we observe differences when we look into style accuracy. For Informal→Formal transfer, accuracy is improved (83.0% → 84.2%) after GEC post editing; while for Formal→Informal transfer, it decreases (68.7% → 47.1%) because grammaticality improvements by GEC may make a sentence become less like an informal sentence.

Sentence compression and simplification
We also test effects of GEC post-editing on sentence compression and simplification. For sentence compression, following Filippova et al. (2015), we train a 2-layer LSTM seq2seq model, which generates a 0/1 sequence to indicate whether to delete a word, as our base model and test on Google's sentence compression dataset 1 (GoogComp). For sentence simplification, we use the state-of-the-art deep reinforcement model DRESS (Zhang and Lapata, 2017) as our base model and test on Newsela text simplification dataset. Table 6 shows the results for the effects of GEC on sentence compression and simplification. For sentence compression, BLEU decreases from 60.38 to 58.77 after GEC post editing. We manually analyze the results and find there are many grammatical errors in the reference sentences. This is not surprising, since the reference sentences are constructed with an automatic approach (Filippova and Altun, 2013). The grammatical errors in the references affect the BLEU evaluation and make it less reliable.
The BLEU decrease is also observed in sentence simplification task but for a different reason. In the Newsela dataset, the reference sentences are written by humans and therefore have much fewer grammatical errors compared to GoogComp. In contrast to sentence compression where reference errors are the main reason for the BLEU decrease, the BLEU decrease in sentence simplification usually happens in the cases where the correction of grammatical errors reduces the sentence's n-gram overlap with the reference sentence, as shown in Table 6 (similar to the phenomenon observed in the experiments for MT; see Table 4). In addition, GEC errors and occasional errors in reference sen-  Reference Error (5.5%) Base: Richie wrote the winning word "magician." (35.5) GEC: Richie wrote the winning word "magician". (7.9) REF: The winning word was "magician." Table 6: Results for sentence compression and sentence simplification. As in Table 4, the numbers in the round brackets following the example sentences are sentence-level BLEU.
tences lead to a decrease of BLEU after GEC post editing.

Human Evaluation
In addition to automatic evaluation (e.g., BLEU), we present human evaluation results for GEC post editing on the tasks. The evaluation includes two aspects: First, we evaluate how much helpful GEC is for improving the grammaticality of sentences generated by the seq2seq models, which is independent to a specific task; Second, we evaluate if GEC's edits bring task-oriented improvements. The evaluation is done by a human judge through comparing the results with/without GEC's edits. Table 7 shows the human evaluation results. For most sentences edited by GEC, their grammaticality is improved; while the bad cases are only in a small proportion (≤10%) in all the six tasks. In contrast, the task-oriented improvements vary across the tasks. For example, for Informal→Formal style transfer, GEC performs well because most of its edits improve the sentences' grammaticality and make the sentences become more formal; in contrast, for Formal→Informal style transfer, GEC improves sentences' grammaticality but affects their styles, making them become less informal.
Moreover, it is observed that GEC is more beneficial to the seq2seq models trained in a low resource setting, by comparing the results of supervised and unsupervised MT, which is consistent with results in Table 3. For sentence compression and simplification, many grammatical improve-ments do not bring task-oriented improvements. The reason is that the parts GEC edits are not the content that should be kept in the results. Also, it is notable that except for Formal→Informal style transfer whose target sentences should be in an informal style, GEC brings much more improvements than adverse effects on the tasks, demonstrating the potential of GEC for NLG.

Related Work and Discussion
The most related work to ours is the automatic post editing (APE) (Bojar et al., 2016) which has been extensively studied for MT (e.g., (Pal et al., 2016(Pal et al., , 2017Chatterjee et al., 2017;Hokamp, 2017;Tan et al., 2017)) in the past few years. These APE approaches are usually trained with source language input data, target language MT output and target language post editing (PE) data. Although these APE models and systems have proven to be successful in improving MT results, they are taskspecific and cannot be used for other NLG tasks.
In contrast, we propose a general post editing approach by applying the current state-of-the-art GEC system to editing the outputs of NLG systems. To the best of our knowledge, this is the first attempt to explore improving seq2seq based NLG models with a state-of-the-art neural GEC system despite some early studies on post-processing SMT outputs using a (mainly rule-based) grammar checker (Stymne and Ahrenberg, 2010). Experiments show GEC post editing can effectively improve the grammaticality of generated text and lead to a task-oriented improvement in the NLG  tasks where target sentences are in a formal style, especially in a low-resource setting.