Fluency Boost Learning and Inference for Neural Grammatical Error Correction

Most of the neural sequence-to-sequence (seq2seq) models for grammatical error correction (GEC) have two limitations: (1) a seq2seq model may not be well generalized with only limited error-corrected data; (2) a seq2seq model may fail to completely correct a sentence with multiple errors through normal seq2seq inference. We attempt to address these limitations by proposing a fluency boost learning and inference mechanism. Fluency boosting learning generates fluency-boost sentence pairs during training, enabling the error correction model to learn how to improve a sentence’s fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps until the sentence’s fluency stops increasing. Experiments show our approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.


Introduction
Sequence-to-sequence (seq2seq) models Sutskever et al., 2014) for grammatical error correction (GEC) have drawn growing attention Xie et al., 2016;Ji et al., 2017;Schmaltz et al., 2017;Chollampatt and Ng, 2018) in recent years. However, most of the seq2seq models for GEC have two flaws. First, the seq2seq models are trained with only limited error-corrected sentence pairs like Figure 1(a). Limited by the size of training data, the models with millions of parameters may not be well generalized. Thus, it is She see Tom is catched by policeman in park at last night. She saw Tom caught by a policeman in the park last night.
She sees Tom is catched by policeman in park at last night. She sees Tom caught by a policeman in the park last night. She sees Tom caught by a policeman in the park last night. She saw Tom caught by a policeman in the park last night. (b) if the sentence becomes slightly different, the model fails to correct it perfectly; (c) single-round seq2seq inference cannot perfectly correct the sentence, but multi-round inference can.
common that the models fail to correct a sentence perfectly even if the sentence is slightly different from the training instance, as illustrated by Figure  1(b). Second, the seq2seq models usually cannot perfectly correct a sentence with many grammatical errors through single-round seq2seq inference, as shown in Figure 1(b) and 1(c), because some errors in a sentence may make the context strange, which confuses the models to correct other errors.
To address the above-mentioned limitations in model learning and inference, this paper proposes a novel fluency boost learning and inference mechanism, illustrated in Figure 2.
For fluency boosting learning, not only is a seq2seq model trained with original errorcorrected sentence pairs, but also it generates less fluent sentences (e.g., from its n-best outputs) to establish new error-corrected sentence pairs by pairing them with their correct sentences during training, as long as the sentences' fluency 1 is be-She see Tom is catched by policeman in park at last night. She saw Tom caught by a policeman in the park last night.
She see Tom is caught by a policeman in park last night. She sees Tom caught by a policeman in the park last night. She saw Tom caught by a policeman in the park last night.
She saw Tom was caught by a policeman in the park last night.
She sees Tom is catched by policeman in park at last night.   Figure 2: Fluency boost learning and inference: (a) given a training instance (i.e., an error-corrected sentence pair), fluency boost learning establishes multiple fluency boost sentence pairs from the seq2seq's n-best outputs during training. The fluency boost sentence pairs will be used as training instances in subsequent training epochs, which helps expand the training set and accordingly benefits model learning; (b) fluency boost inference allows an error correction model to correct a sentence incrementally through multi-round seq2seq inference until its fluency score stops increasing.
low that of their correct sentences, as Figure 2(a) shows. Specifically, we call the generated errorcorrected sentence pairs fluency boost sentence pairs because the sentence in the target side always improves fluency over that in the source side. The generated fluency boost sentence pairs during training will be used as additional training instances during subsequent training epochs, allowing the error correction model to see more grammatically incorrect sentences during training and accordingly improving its generalization ability.
For model inference, fluency boost inference mechanism allows the model to correct a sentence incrementally with multi-round inference as long as the proposed edits can boost the sentence's fluency, as Figure 2(b) shows. For a sentence with multiple grammatical errors, some of the errors will be corrected first. The corrected parts will make the context clearer, which may benefit the model to correct the remaining errors.
Experiments demonstrate fluency boost learning and inference enable neural seq2seq models to perform better for GEC and achieve state-of-theart results on multiple GEC benchmarks.
Our contributions are summarized as follows: • We present a novel learning and inference mechanism to address the limitations in previous seq2seq models for GEC.
• We propose and compare multiple novel fluency boost learning strategies, exploring the learning methodology for neural GEC.
• Our approaches are proven to be effective to improve neural seq2seq GEC models to achieve state-of-the-art results on CoNLL-2014 and JFLEG benchmark datasets.
2 Background: Neural grammatical error correction As neural machine translation (NMT), a typical neural GEC approach uses a Recurrent Neural Network (RNN) based encoder-decoder seq2seq model (Sutskever et al., 2014; with attention mechanism  to edit a raw sentence into the grammatically correct sentence it should be, as Figure 1(a) shows. Given a raw sentence x r = (x r 1 , · · · , x r M ) and its corrected sentence x c = (x c 1 , · · · , x c N ) in which x r M and x c N are the M -th and N -th words of sentence x r and x c respectively, the error correction seq2seq model learns a probabilistic mapping P (x c |x r ) from error-corrected sentence pairs through maximum likelihood estimation (MLE), which learns model parameters Θ crt to maximize the following equation: where S * denotes the set of error-corrected sentence pairs. For model inference, an output sequence x o = (x o 1 , · · · , x o i , · · · , x o L ) is selected through beam search, which maximizes the following equation:

Fluency boost learning
Conventional seq2seq models for GEC learns model parameters only from original errorcorrected sentence pairs. However, such errorcorrected sentence pairs are not sufficiently available. As a result, many neural GEC models are not very well generalized. Fortunately, neural GEC is different from NMT. For neural GEC, its goal is improving a sentence's fluency 2 without changing its original meaning; thus, any sentence pair that satisfies this condition (we call it fluency boost condition) can be used as a training instance.
In this paper, we define f (x) as the fluency score of a sentence x: where P (x i |x <i ) is the probability of x i given context x <i , computed by a language model, and |x| is the length of sentence x. H(x) is actually the cross entropy of the sentence x, whose range is The core idea of fluency boost learning is to generate fluency boost sentence pairs that satisfy the fluency boost condition during training, as Figure 2(a) illustrates, so that these pairs can further help model learning.
In this section, we present three fluency boost learning strategies: back-boost, self-boost, and dual-boost that generate fluency boost sentence pairs in different ways, as illustrated in Figure 3.

Back-boost learning
Back-boost learning borrows the idea from back translation (Sennrich et al., 2016) in NMT, referring to training a backward model (we call it error generation model, as opposed to error correction model) that is used to convert a fluent sentence to a less fluent sentence with errors. Since the less fluent sentences are generated by the error generation seq2seq model trained with error-corrected data, they usually do not change the original sentence's meaning; thus, they can be paired with their correct sentences, establishing fluency boost sentence pairs that can be used as training instances for error correction models, as Figure 3(a) shows.
Specifically, we first train a seq2seq error generation model Θ gen with S * which is identical to S * except that the source sentence and the target sentence are interchanged. Then, we use the model Θ gen to predict n-best outputs x o 1 , · · · , x on given a correct sentence x c . Given the fluency boost condition, we compare the fluency of each output x o k (where 1 ≤ k ≤ n) to that of its correct sentence x c . If an output sentence's fluency score is much lower than its correct sentence, we call it a disfluency candidate of x c .
To formalize this process, we first define Y n (x; Θ) to denote the n-best outputs predicted by model Θ given the input x. Then, disfluency candidates of a correct sentence x c can be derived: Algorithm 1 Back-boost learning 1: Train error generation model Θgen with S * ; 2: for each sentence pair (x r , x c ) ∈ S do 3: Compute D back (x c ) according to Eq (5); 4: end for 5: for each training epoch t do 6: S ← ∅; 7: Derive a subset St by randomly sampling |S * | elements from S; 8: for each (x r , x c ) ∈ St do 9: Establish a fluency boost pair (x , x c ) by randomly sampling x ∈ D back (x c ); 10: end for 12: Update error correction model Θcrt with S * ∪ S ; 13: end for where D back (x c ) denotes the disfluency candidate set for x c in back-boost learning. σ is a threshold to determine if x o k is less fluent than x c and it should be slightly larger 3 than 1.0, which helps filter out sentence pairs with unnecessary edits (e.g., I like this book. → I like the book.).
In the subsequent training epochs, the error correction model will not only learn from the original error-corrected sentence pairs (x r ,x c ), but also learn from fluency boost sentence pairs ( We summarize this process in Algorithm 1 where S * is the set of original error-corrected sentence pairs, and S can be tentatively considered identical to S * when there is no additional native data to help model training (see Section 3.4). Note that we constrain the size of S t not to exceed |S * | (the 7th line in Algorithm 1) to avoid that too many fluency boost pairs overwhelm the effects of the original error-corrected pairs on model learning.

Self-boost learning
In contrast to back-boost learning whose core idea is originally from NMT, self-boost learning is original, which is specially devised for neural GEC. The idea of self-boost learning is illustrated by Figure 3(b) and was already briefly introduced in Section 1 and Figure 2(a). Unlike back-boost learning in which an error generation seq2seq model is trained to generate disfluency candidates, self-boost learning allows the error correction model to generate the candidates by itself. Since the disfluency candidates generated by the error correction seq2seq model trained with error-corrected data rarely change the input Algorithm 2 Self-boost learning 1: for each sentence pair (x r , x c ) ∈ S do 2: D self (x c ) ← ∅; 3: end for 4: S ← ∅ 5: for each training epoch t do 6: Update error correction model Θcrt with S * ∪ S ; 7: S ← ∅ 8: Derive a subset St by randomly sampling |S * | elements from S; 9: for each (x r , x c ) ∈ St do 10: Update D self (x c ) according to Eq (6); 11: Establish a fluency boost pair (x , x c ) by randomly sampling x ∈ D self (x c ); 12: end for 14: end for sentence's meaning; thus, they can be used to establish fluency boost sentence pairs.
For self-boost learning, given an error corrected pair (x r , x c ), an error correction model Θ crt first predicts n-best outputs x o 1 , · · · , x on for the raw sentence x r . Among the n-best outputs, any output that is not identical to x c can be considered as an error prediction. Instead of treating the error predictions useless, self-boost learning fully exploits them. Specifically, if an error prediction x o k is much less fluent than that of its correct sentence x c , it will be added to x c 's disfluency candidate set D self (x c ), as Eq (6) shows: In contrast to back-boost learning, self-boost generates disfluency candidates from a different perspective -by editing the raw sentence x r rather than the correct sentence x c . It is also noteworthy that D self (x c ) is incrementally expanded because the error correction model Θ crt is dynamically updated, as shown in Algorithm 2.

Dual-boost learning
As introduced above, back-and self-boost learning generate disfluency candidates from different perspectives to create more fluency boost sentence pairs to benefit training the error correction model. Intuitively, the more diverse disfluency candidates generated, the more helpful for training an error correction model. Inspired by He et al. (2016) and Zhang et al. (2018), we propose a dual-boost learning strategy, combining both back-and selfboost's perspectives to generate disfluency candidates.
Algorithm 3 Dual-boost learning 1: for each (x r , x c ) ∈ S do 2: D dual (x c ) ← ∅; 3: end for 4: S ← ∅; S ← ∅; 5: for each training epoch t do 6: Update error correction model Θcrt with S * ∪ S ; 7: Update error generation model Θgen with S * ∪ S ; 8: S ← ∅; S ← ∅; 9: Derive a subset St by randomly sampling |S * | elements from S; 10: for each (x r , x c ) ∈ St do 11: Update D dual (x c ) according to Eq (7); 12: Establish a fluency boost pair (x , x c ) by randomly sampling x ∈ D dual (x c ); 13: Establish a reversed fluency boost pair (x c , x ) by randomly sampling x ∈ D dual (x c ); 15: end for 17: end for As Figure 3(c) shows, disfluency candidates in dual-boost learning are from both the error generation model and the error correction model : Moreover, the error correction model and the error generation model are dual and both of them are dynamically updated, which improves each other: the disfluency candidates produced by error generation model can benefit training the error correction model, while the disfluency candidates created by error correction model can be used as training data for the error generation model. We summarize this learning approach in Algorithm 3.

Fluency boost learning with large-scale native data
Our proposed fluency boost learning strategies can be easily extended to utilize the huge volume of native data which is proven to be useful for GEC. As discussed in Section 3.1, when there is no additional native data, S in Algorithm 1-3 is identical to S * . In the case where additional native data is available to help model learning, S becomes: S = S * ∪ C where C = {(x c , x c )} denotes the set of selfcopied sentence pairs from native data.

Fluency boost inference
As we discuss in Section 1, some sentences with multiple grammatical errors usually cannot be perfectly corrected through normal seq2seq inference which does only single-round inference. Fortunately, neural GEC is different from NMT: its source and target language are the same. The characteristic allows us to edit a sentence more than once through multi-round model inference, which motivates our fluency boost inference. As Figure 2(b) shows, fluency boost inference allows a sentence to be incrementally edited through multiround seq2seq inference as long as the sentence's fluency can be improved. Specifically, an error correction seq2seq model first takes a raw sentence x r as an input and outputs a hypothesis x o 1 . Instead of regarding x o 1 as the final prediction, fluency boost inference will then take x o 1 as the input to generate the next output x o 2 . The process will not terminate unless x ot does not improve x o t−1 in terms of fluency.

Dataset and evaluation
As previous studies (Ji et al., 2017), we use the public Lang-8 Corpus (Mizumoto et al., 2011;Tajiri et al., 2012), Cambridge Learner Corpus (CLC) (Nicholls, 2003) and NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) as our original error-corrected training data. Table 1 shows the stats of the datasets. In addition, we also collect 2,865,639 non-public errorcorrected sentence pairs from Lang-8.com. The native data we use for fluency boost learning is English Wikipedia that contains 61,677,453 sentences. We use CoNLL-2014 shared task dataset with original annotations (Ng et al., 2014), which contains 1,312 sentences, as our main test set for evaluation. We use MaxMatch (M 2 ) precision, recall and F 0.5 (Dahlmeier and Ng, 2012b) as our evaluation metrics. As previous studies, we use CoNLL-2013 test data as our development set.

Experimental setting
We set up experiments in order to answer the following questions:  • Whether is fluency boost learning mechanism helpful for training the error correction model, and which of the strategies (back-boost, selfboost, dual-boost) is the most effective?
• Whether does our fluency boost inference improve normal seq2seq inference for GEC?
• Whether can our approach improve neural GEC to achieve state-of-the-art results?
The training details for our seq2seq error correction model and error generation model are as follows: the encoder of the seq2seq models is a 2-layer bidirectional GRU RNN and the decoder is a 2-layer GRU RNN with the general attention mechanism (Luong et al., 2015). Both the dimensionality of word embeddings and the hidden size of GRU cells are 500. The vocabulary sizes of the encoder and decoder are 100,000 and 50,000 respectively. The models' parameters are uniformly initialized in [-0.1,0.1]. We train the models with an Adam optimizer with a learning rate of 0.0001 up to 40 epochs with batch size = 128. Dropout is applied to non-recurrent connections at a ratio of 0.15. For fluency boost learning, we generate disfluency candidates from 10-best outputs. During model inference, we set beam size to 5 and decode 1-best result with a 2-layer GRU RNN language model (Mikolov et al., 2010) through shallow fusion (Gülçehre et al., 2015) with weight β = 0.15. The RNN language model is trained from the native data mentioned in Section 5.1, which is also used for computing fluency score in Eq (3). UNK tokens are replaced with the source token with the highest attention weight.
We resolve spelling errors with a public spell checker 4 as preprocessing, as Xie et al. (2016) and  do. Table 2 compares the performance of seq2seq error correction models with different learning and inference methods. By comparing by row, one can observe that our fluency boost learning approaches improve the performance over normal seq2seq learning, especially on the recall metric, since the fluency boost learning approaches generate a variety of grammatically incorrect sentences, allowing the error correction model to learn to correct much more sentences than the conventional learning strategy. Among the proposed three fluency boost learning strategies, dual-boost achieves the best result in most cases because it produces more diverse incorrect sentences (average |D dual | ≈ 9.43) than either back-boost (avg |D back | ≈ 1.90) or self-boost learning (avg |D self | ≈ 8.10). With introducing large amounts of native text data, the performance of all the fluency boost learning approaches gets improved. One reason is that our learning approaches produce more error-corrected sentence pairs to let the model be better generalized. In addition, the huge volume of native data benefits the decoder to learn better to generate a fluent and error-free sentence.

Effectiveness of fluency boost learning
We test the effect of hyper-parameter σ in Eq (5-7) on fluency boost learning and show the result in Table 3. When σ is slightly larger than 1.0 (e.g., σ = 1.05), the model achieves the best performance because it effectively avoids generating sentence pairs with unnecessary or undesirable edits that affect the performance, as we discussed in Section 3.1. When σ continues increasing, the disfluency candidate set |D dual | drastically decreases, making the dual-boost learning gradually degrade to normal seq2seq learning. Table 4 shows some examples of disfluency  Table 3: The effect of σ on dual-boost learning with normal seq2seq inference. |D dual | is the average size of dual-boost disfluency candidate sets.

Correct sentence
How autism occurs is not well understood.

Disfluency candidates
How autism occurs is not good understood. How autism occur is not well understood. What autism occurs is not well understood. How autism occurs is not well understand. How autism occurs does not well understood. candidates 5 generated in dual-boost learning given a correct sentence in the native data. It is clear that our approach can generate less fluent sentences with various grammatical errors and most of them are typical mistakes that a human learner tends to make. Therefore, they can be used to establish high-quality training data with their correct sentence, which will be helpful for increasing the size of training data to numbers of times, accounting for the improvement by fluency boost learning.

Effectiveness of fluency boost inference
The effectiveness of various inference approaches can be observed by comparing the results in Table  2 by column. Compared to the normal seq2seq inference and seq2seq (+LM) baselines, fluency boost inference brings about on average 0.14 and 0.18 gain on F 0.5 respectively, which is a significant 6 improvement, demonstrating multi-round edits by fluency boost inference is effective. Take our best system (the last row in Table  2) as an example, among 1,312 sentences in the CoNLL-2014 dataset, seq2seq inference with shallow fusion LM edits 566 sentences. In contrast, fluency boost inference additionally edits 23 sentences during the second round inference, improving F 0.5 from 52.59 to 52.72.

Towards the state-of-the-art for GEC
Now, we answer the last question raised in Section 5.2 by testing if our approaches achieve the stateof-the-art result.
We first compare our best models -dual-boost learning (+native) with fluency boost inference and shallow fusion LM -to top-performing GEC systems evaluated on CoNLL-2014 dataset: 5 We give more details about disfluency candidates, including error type proportion, in the supplementary notes. 6 p < 0.0005 according to Wilcoxon Signed-Rank Test.  Table 5: Performance of systems on CoNLL-2014 dataset. The system with bold fonts are based on seq2seq models. denotes the system uses the non-public error-corrected data from Lang-8.com.
• CUUI and VT16: the former system (Rozovskaya et al., 2014) uses a classifier-based approach, which is improved by the latter system (Rozovskaya and Roth, 2016) through combining with an SMT-based approach.
• Char-seq2seq: a character-level seq2seq model (Xie et al., 2016). It uses a rule-based method to synthesize errors for data augmentation.
• Adapt-seq2seq: a seq2seq model adapted to incorporate edit operations (Schmaltz et al., 2017). Table 5 shows the evaluation results on the CoNLL-2014 dataset. Without using the nonpublic training data from Lang-8.com, our sin-gle model obtains 50.04 F 0.5 , larlgely outperforming the other seq2seq models and only inferior to CAMB17 (AMU16 based) and NUS17. It should be noted, however, that the CAMB17 and NUS17 are actually re-rankers built on top of an SMTbased GEC system (AMU16's framework); thus, they are ensemble models. When we build our approach on top of AMU16 (i.e., we take AMU16's outputs as the input to our GEC system to edit on top of its outputs), we achieve 53.30 F 0.5 score. With introducing the non-public training data, our single and ensemble system obtain 52.72 and 54.51 F 0.5 score respectively, which is a stateof-the-art result 7 on CoNLL-2014 dataset.
Moreover, we evaluate our approach on JFLEG corpus (Napoles et al., 2017). JFLEG is the latest released dataset for GEC evaluation and it contains 1,501 sentences (754 in dev set and 747 in test set). To test our approach's generalization ability, we evaluate our single models used for CoNLL evaluation (in Table 5) on JFLEG without re-tuning. Table 6 shows the JFLEG leaderboard. Instead of M 2 score, JFLEG uses GLEU (Napoles et al., 2015) as its evaluation metric, which is a fluencyoriented GEC metric based on a variant of BLEU (Papineni et al., 2002) and has several advantages over M 2 for GEC evaluation. It is observed that our single models consistently perform well on JFLEG, outperforming most of the CoNLL-2014 top-performing systems and yielding a state-ofthe-art result 8 on this benchmark, demonstrating that our models are well generalized and perform stably on multiple datasets.

Related work
Most of advanced GEC systems are classifierbased (Chodorow et al., 2007;De Felice and Pulman, 2008;Han et al., 2010;Leacock et al., 2010;Tetreault et al., 2010a;Dale and Kilgarriff, 2011) 7 The state-of-the-art result on CoNLL-2014 dataset has been recently advanced by Chollampatt and Ng (2018) (F0.5=54.79) and Grundkiewicz and Junczys-Dowmunt (2018) (F0.5=56.25), which are contemporaneous to this paper. In contrast to the basic seq2seq model in this paper, they used advanced approaches for modeling (e.g., convolutional seq2seq with pre-trained word embedding, using edit operation features, ensemble decoding and advanced model combinations). It should be noted that their approaches are orthogonal to ours, making it possible to apply our fluency boost learning and inference mechanism to their models. 8 The recently proposed SMT-NMT hybrid system (Grundkiewicz and Junczys-Dowmunt, 2018), which is tuned towards GLEU on JFLEG Dev set, reports a higher result (GLEU=61.50 on JFLEG test set).  Table 6: JFLEG Leaderboard. Ours denote the single dual-boost models in Table 5. The systems with bold fonts are based on seq2seq models. * denotes the system is tuned on JFLEG.
or MT-based (Brockett et al., 2006;Ng, 2011, 2012a;Yoshimoto et al., 2013;Yuan and Felice, 2013;Behera and Bhattacharyya, 2013). For example, top-performing systems Rozovskaya et al., 2014;Junczys-Dowmunt and Grundkiewicz, 2014) in CoNLL-2014 shared task (Ng et al., 2014) use either of the methods. Recently, many novel approaches Chollampatt et al., 2016b,a;Rozovskaya and Roth, 2016;Junczys-Dowmunt and Grundkiewicz, 2016;Mizumoto and Matsumoto, 2016;Hoang et al., 2016;Yannakoudakis et al., 2017) have been proposed for GEC. Among them, seq2seq models Xie et al., 2016;Ji et al., 2017;Schmaltz et al., 2017;Chollampatt and Ng, 2018) have caught much attention. Unlike the models trained only with original error-corrected data, we propose a novel fluency boost learning mechanism for dynamic data augmentation along with training for GEC, despite some previous studies that explore artificial error generation for GEC (Brockett et al., 2006;Foster and Andersen, 2009;Roth, 2010, 2011;Rozovskaya et al., 2012;Xie et al., 2016;. Moreover, we propose fluency boost inference which allows the model to repeatedly edit a sentence as long as the sentence's fluency can be improved. To the best of our knowledge, it is the first to conduct multi-round seq2seq inference for GEC, while similar ideas have been proposed for NMT (Xia et al., 2017).
In addition to the studies on GEC, there is also much research on grammatical error detection (Leacock et al., 2010;Rei and Yannakoudakis, 2016;Kaneko et al., 2017) and GEC evaluation (Tetreault et al., 2010b;Madnani et al., 2011;Dahlmeier and Ng, 2012c;Napoles et al., 2015;Bryant et al., 2017;Asano et al., 2017). We do not introduce them in detail because they are not much related to this paper's contributions.

Conclusion
We propose a novel fluency boost learning and inference mechanism to overcome the limitations of previous neural GEC models. Our proposed fluency boost learning fully exploits both errorcorrected data and native data, largely improving the performance over normal seq2seq learning, while fluency boost inference utilizes the characteristic of GEC to incrementally improve a sentence's fluency through multi-round inference. The powerful learning and inference mechanism enables the seq2seq models to achieve state-ofthe-art results on both CoNLL-2014 and JFLEG benchmark datasets.