Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation

The incorporation of data augmentation method in grammatical error correction task has attracted much attention. However, existing data augmentation methods mainly apply noise to tokens, which leads to the lack of diversity of generated errors. In view of this, we propose a new data augmentation method that can apply noise to the latent representation of a sentence.By editing the latent representations of grammatical sentences, we can generate synthetic samples with various error types. Combining with some pre-defined rules, our method can greatly improve the performance and robustness of existing grammatical error correction models. We evaluate our method on public benchmarks of GEC task and it achieves the state-of-the-art performance on CoNLL-2014 and FCE benchmarks.


Introduction
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in a sentence. Due to the growing number of language learners of English, there has been increasing attention to the English GEC in the past few years.
Considering the outstanding performance of neural network models in machine translation tasks, many studies have tackled GEC as a machine translation task. They regard ungrammatical sentences as the source language and grammatical sentences as the target language. This approach allows cutting-edge neural machine translation models to be applied to GEC. Many encoder-decoder models, such as recurrent neural network (RNN) (Graves et al., 2013), convolutional neural networks (CNN) (Kim, 2014), have been widely applied to GEC.
A challenge in applying neural machine translation models to GEC is the requirement of a large amount of training data, i.e., the source-target pairs. To address this problem, many data augmentation methods have been proposed. Existing methods, however, are often only able to generate sentences with limited error types, and can only improve the performance of GEC model on these few error types, while it is still hard for the model to correct the sentences with other types of errors.
To address the above problem, we propose a new data augmentation method to generate synthetic samples by editing the latent representations of grammatical sentences. Given a target grammatical error type and the corresponding grammatical error type classifier, we can get a perturbation vector in latent space. Then we add the perturbation vector to the latent representation of input sentence, and use a decoder to generate a sentence with target grammatical error type. In this way, diverse errors can be generated by assigning different target error types. To further improve the performance, we adopt some rules to assist the generation of some local grammatical errors, such as spelling errors, wrong punctuation, etc.
We apply this data augmentation method to the existing GEC model Copy-transformer (Zhao et al., 2019) to evaluate the results. Experiments are conducted on the following widely used benchmarks: CoNLL-2014(Ng et al., 2014, FCE (Yannakoudakis et al., 2011), BEA-2019 (Bryant et al., 2019). Experimental results show the efficacy of our proposed method which outperforms several existing models.
Our contributions are summarized as follows: 1. We propose a new data augmentation method to generate synthetic samples by editing latent representations of grammatical sentences, which is able to generate errors with high quality and diversity.
2. Additional synthetic training samples enable training neural GEC model to detect and correct most error types, and improving the performance and robustness of the model.
3. Our method achieves the state-of-the-art performance on CoNLL-2014 and FCE benchmarks. It outperforms not only all previous single models but also all ensemble models. On BEA-2019 benchmark, our method achieves very competitive performance as well.

Related Work
Early GEC models are mainly based on manually designed grammar rules (Murata and Nagao, 1994;Bond et al., 1996;Siegel, 1996). Han et al. (2006) pointed out the limitation of rule-based method and proposed a statistical model. Later, some researchers proposed solutions based on statistical machine learning method (Knight and Chander, 1994;Minnen et al., 2000;Izumi et al., 2003).
With the development of deep learning, recent works proposed a variety of neural network models to deal with GEC task. Some treated the GEC task as a translation problem and applied neural machine translation model to detect and correct grammatical errors. Yuan and Briscoe (2016) used a classical bidirectional recurrent neural network (Graves et al., 2013) with attention. Chollampatt and Ng (2018) proposed a convolution neural network (Kim, 2014) to capture the local context. Many recent works (Junczys-Dowmunt et al., 2018) made use of the powerful machine translation architecture Transformer (Vaswani et al., 2017). Zhao et al. (2019) further applied copying mechanism (Gu et al., 2016;Jia and Liang, 2016) to Transformer. Considering the tremendous performance of pre-trained methods, pretrained language model, such as BERT (Devlin et al., 2019), can be incorporated into the encoder-decoder model (Kaneko et al., 2020).
An encoder-decoder GEC model requires a large amount of training data, and the available training corpora usually failed to train a good GEC model. To address this problem, many data augmentation methods have been proposed (Ge et al., 2018). Many works adopted pre-defined rules to generate local grammatical errors. Grundkiewicz et al. (2019) applied a confusion set built by spellchecker to the corpus. Choe et al.(2019) extracted some common text editing operations from human writing habits and got synthetic samples by these extracted operations. Lichtarge et al. (Lichtarge et al., 2018) made use of models trained on large amounts of weakly supervised text. Inspired by back-translation procedure for machine translation (Sennrich et al., 2015) , Xie et al.(2018) proposed a model that can learn to generate erroneous sentences from correct ones. Based on this work, Kiyono et al. (2019) further studied the data augmentation methods and got some empirical conclusions.

Our Data Augmentation Method
Our data augmentation method applies noise to the latent space and it can generate sentences with various error types by editing latent representations of grammatical sentences. To further improve the performance, we adopt some rules to assist the generation of some local grammatical errors. Synthetic training samples generated from our method are used to train neural GEC model and enable the model to detect and correct most error types and improve its performance and robustness.

Editing Latent Representation
Inspired by the adversarial sample generation procedure (Goodfellow et al., 2015), we propose a data augmentation method that applies the noise by editing latent representations. Firstly, we train an encoder, a decoder and an error type classifier. Given the trained models, we can generate synthetic training samples by adding a perturbation vector to the latent representations of sentences. The overall framework of our method is shown in Figure 1.

Training Encoder, Decoder and Classifier
Firstly, we train an encoder and a classifier to deal with the grammatical error classification task. Given a ungrammatical sentence x and its corresponding error type z, we use the encoder φ E to encode x to its latent representation h x . Then we use the classifier C to get the prediction z . This process can be formulated as following: We denote the classification loss as L(h x , z, z ), where h x is the latent representation, z is the prediction label and z is the gold label. In our model, we choose the cross entropy loss as classification loss. Figure 1: The overall framework of our proposed data augmentation method. It contains an encoder, a classifier, and a decoder. Given a source sentence x, we first use the encoder to obtain the latent representation h x of x. We then pass h x and specified error type z to the classifier to compute the classification loss and the direction in which the loss descends the most. Finally, we project h x to this direction to get h x and decode h x into text to get the synthetic sample x .
With the encoder φ E trained in previous process , we train decoder φ D in an auto-encoder way. The goal is to minimize the negative log-likelihood between input x and outputx: We choose the powerful Transformer (Vaswani et al., 2017) as encoder and decoder. Both the encoder and the decoder consist of Transformer blocks with multi-head self-attention layer followed by feedforward layer.
As for the classifier, it has several feed-forward layers as the classification layers. The classifier will determine whether the sentence is correct. If not, it will predict the specific type of grammatical errors. We define six types of grammatical errors based on the 25 main types defined in automatic annotation tool ERRANT (Bryant et al., 2017) These error types are common in human writing and more difficult to be corrected by GEC model.

Generating Synthetic Training Samples
We want to add a perturbation vector r to the latent representation h x of input sentence x, and use decoder φ D to generate additional training samples from h x + r. The algorithm for generating synthetic samples is summarized in Algorithm 1.
Given a correct sentence x and a target grammatical error type z, we can get an optimal perturbation vectorr by minimizing the classification loss L(h x + r, z, z ). Besides, in order to prevent the outputs Algorithm 1 Generating synthetic training samples Input: Latent representation h x , target error type z, similarity discriminator ψ, classifier C,decoder φ D , hyper parameter 0 , max , λ, t Output: Synthetic erroneous sample x (paired with input x) Function: Gen(h x , z, 0 , max , λ, t): return null from changing too much, we restrict the L2 norm of perturbation r. This problem can be formulated as the following:r = argmin However, it is almost impossible to exactly estimater in Eq5 for a deep neural network model. Following the method of Goodfellow et al. (2015), we apply the linearization technique by linearizing loss function L(h x + r, z, z ) around h x , and get the solution as follows: The hyper-parameter determines the degree of semantic change in the latent space. A small value can better maintain the semantic, while a large value can make it easier to generate a sample with grammatical errors. We use a heuristic algorithm to select the most appropriate value. We initialize with a small value, and gradually increase it until the sentence with target grammatical error is produced or the threshold is reached.
Finally, we use decoder φ D to decode from h x +r and generate the corresponding erroneous sentence x : In order to filter sentence pairs with low similarity, we use a model proposed by Parikh et al. (2016) as the similarity discriminator. Given an synthetic sample x with its original sentence x, we use similarity discriminator ψ to get a score p ∈ [0, 1] which reflects the degree of semantic similarity between x and x . We set a threshold t that if p is greater than this threshold, x can be selected as the augmented sample.
Our method can generate more natural sentences compared to methods that directly apply noise to tokens. Since we edit latent representations of sentences, we can get more diverse samples with different errors which can not be obtained by only applying noise to tokens.

Pre-defined Rules
Based on previous works (Choe et al., 2019;Lichtarge et al., 2018), the rule-based method can generate local grammatical errors with high quality. We propose five rules to assist in generating synthetic training data.
Delete. Randomly delete a token with a probability of 0.15. Add. Firstly, randomly select a word from a word list (Google-10000-English 1 ), and then add the selected word to random position with a probability of 0.15.
Replace. Randomly replace a token with its possible forms with a probability of 0.5. If the picked token is a word, we use Word forms 2 to generate all possible forms (adverb, adjective, noun and verb). If not, we select replacements from a punctuation set.
Shuffle. Shuffle the tokens by adding a normal distribution bias to the positions of the words with a probability of 0.1. Particularly, let x = (p 1 , ..., p L ) represent the positions of words, where L is the length of sentence and p i is the position of the i-th word. At the beginning, p i = i. Then, add the normal distribution bias, where e i subjects to normal distribution e i ∼ N (0, σ 2 ). Finally, re-sort the words by the rectified positions p i and get the new sequence x . Spell Error. Randomly apply spell error to a word with a probability of 0.1. We randomly perturb characters using the same operations as above for the word level operations, i.e. substitution, deletion, insertion or transposition of characters.
Using above data augmentation methods, we can get synthetic training samples with various grammatical errors. These synthetic training samples can further improve the performance and robustness of the GEC system.

GEC Model
In this study, we choose copy-augmented Transformer (Zhao et al., 2019) as GEC model to test our data augmentation method. Copy-augmented Transformer is a kind of Transformer that incorporates an attention-based copy mechanism in the decoder. It can generate word from a fixed vocabulary and the source input tokens. Considering the similarity between input and output, this copy mechanism leads to great performance of models in GEC task.
NUCLE is a collection of essays written by students who are non-native English speakers. Professional English instructors were invited to correct the grammatical errors in these essays. There are 28 common grammatical error types.
The Lang-8 corpus is a cleaned English subset of the language learning websites. FCE and W&I+LOCNESS are public GEC datasets. Bryant et al.(2019) use an automatic annotation tool ERRANT to annotate the types of grammatical errors. There are 25 main grammatical error types.
Evaluation data. We report results on CoNLL-2014 benchmark evaluated by official M2 scorer (Dahlmeier and Ng, 2012), and on BEA-2019 and FCE benchmarks evaluated by ERRANT.
Seed Corpus Following the Kiyono et al.(2019), we choose the large English corpus Gigaword as seed corpus for data augmentation.
The datasets used in the experiments are summarized in Table 1.

Pre-processing
We first tokenize the data by NLTK (Bird et al., 2009 In this way, we can get a large amount of annotated data for the training of the grammatical error type classifier.

Model Training Details
In this paper, we use the Transformer implementation in the public Fairseq Toolkit (Ott et al., 2019). For the Transformer model, the hidden size of embedding is 512. The encoder and decoder have 6 layers and 8 attention heads. For the inner layer in the feed-forward network, the size of is 4096. The number of feed-forward layers in classifier is 3. The classifier model is trained using the Adam optimization method (Kingma and Ba, 2015). The learning rate is initially set as 0.001 ,the decay factor is set as 0.99 for every epoch. To avoid overfitting, we adopt dropout mechanism (Srivastava et al., 2014). The dropout rate is 0.1.
During decoding, models are optimized with Nesterovs Accelerated Gradient (Nesterov, 1983). We set the dropout to 0.2, the learning rate with 0.002, the weight decay 0.5, the patience 0, the momentum 0.99, minimum learning rate 10-4, and beam-size 5.
From the seed corpus, we generate 16 million training samples in all. Half of training samples are generated by editing latent representations and the other half are generated by pre-defined rules. The probability of five error types (ADJ/ADV, DET, PREP, NOUN, VERB) is equal.
For the GEC model, we follow the default configuration of the copy-augmented Transformer from Zhao et al. (2019). Following Omelianchuk et al.(2020), we train the GEC model in three stages. Firstly, we pre-train the model on synthetic sentences that generated by data augmentation method. Then, we extract sentence pairs containing grammatical errors from four training datasets (NUCLE,FCE and W&I+LOCNESS) and fine-tune the model on these sentence pairs. Finally, we fine-tune the model on respective entire training dataset corresponding to each test set.

Post-processing
To further improve the performance, we incorporate the following techniques that are widely used in GEC task: Features Re-scoring (FR). Following Chollampatt and Ng(2018), we use edit operation (EO) features and language model (LM) features to re-score the final beam candidates. EO features denote three features about token-level edit operation. LM features include the score of a language model which is trained on the web-scale Common Crawl corpus (Chollampatt and Ng, 2017;Junczys-Dowmunt and Grundkiewicz, 2016), and the length of the output sequences.
Right-to-left Re-ranking (R2L). Following Sennrich et al. (2016a;, we use the right-to-left re-ranking method to build the ensemble of independently trained models. We pass n-best candidates generated from four left-to-right models to four right-to-left models, and re-rank the n-best candidates based on their corresponding scores.
6 Results and Analysis

Compared with Existing Methods
We evaluate the performance of our method on public benchmarks and compare the scores with the current top models which adopt data augmentation methods. Table 2 shows the results. Our method achieves the best F-scores on CoNLL-2014 and FCE benchmarks. It outperforms not only all previous single models but also all ensemble models. On BEA-2019 benchmark, our method achieves very competitive performance as well.

Method
Augmented Data Size  The second group shows the results of the ensemble models. Augmented data sizes show the amounts of additional training sentences used in each method. Bold indicates the highest score in each column.

Analysis of Data Augmentation Method
We evaluate the performance of our different data augmentation methods on benchmarks. Results are shown in Table 3. 'None' means no use of data augmentation method and 'Both' means using both representation editing based method and rule-based method.  Table 3: Results of our different data augmentation methods.

Method
As we can see, our data augmentation methods can improve the performance of the GEC model, especially the recall. A large amount of synthetic training samples enable the model to detect and correct more errors.
CoNLL-2014 is a typical benchmark, widely used for evaluating GEC models. Besides, this dataset has been hand-corrected by professional English instructors. In view of these, we use CoNLL-2014 dataset as example to do the following experiments.
We investigate the influence of the amount of synthetic training samples on the performance. We pre-train the ensemble model with different amounts of synthetic samples. Considering the limitation of computing resources, we set amounts of synthetic samples as {1M, 2M, 4M, 8M, 16M}. As mentioned above, we use pre-defined rules and editing latent representation method to generate half of the synthetic samples respectively in this experiment. The results in Figure 2 show that the increase of synthetic samples can improve the performance of GEC model, but the growth rate is on the decline.   We further analyze the results on different error types. Note that due to the definition of precision on CoNLL-2014 dataset, we cannot calculate the precision for each error type. So we use recall to evaluate the performance of our different data augmentation methods. In CoNLL-2014 dataset, 28 error types are defined, and we list the recall on the top 9 error types. The other 19 types are summarized in the 'others' type. Results are shown in Table 4.
As can be seen from the table, the correction abilities of model on different error types are quite different. For example, model without synthetic data corrects 53.3% errors on the 'Noun number' type, but only corrects 9.41% errors on the 'Wrong Collocation/Idiom' type. Data augmentation methods can address this problem in some degree. The recall of local errors, such as 'Spelling, Punctuation, etc' and 'Noun number', is improved by rule-based method. Representation editing based method improves the performance of model on other error types, such as 'Verb tense' and 'verb form'. With the assistance of pre-defined rules, our method achieves the highest recall on most error types. Our proposed method not only performs well in overall evaluation score, but also greatly improves the performance on most error types. The two data augmentation methods can complement each other and the use of both of them can cover most error types and generate samples with high quality and diversity. It enables the model to detect and correct various errors, which meets the needs in practical application.

Case Study
In this section, we use specific cases to analyze influence of different data augmentation methods.
In Table 5, we present an example of the synthetic samples generated by our different data augmentation methods. In this case, rule-based method generates an ungrammatical sentence with verb tense error by replacing the verb 'crashed' with its present tense. The editing latent representation based method

Grammatical Sentence
An Israeli military helicopter crashed near the northern town of Afula, army radio said Augmentation Method Example Pre-defined Rules An Israeli military helicopter crashes near the northern town of Afula, army radio said. Editing Latent Representation An Israeli military helicopter has been crashed near the northern town of Afula, army radio said. also generates a sample with verb tense error. However, this error cannot be generated by using simple pre-defined rules. What's more, the error is more complicated than a simple verb tense error. In fact, it is a grammatical error related to reported speech. Therefore, editing latent representation based method can generate errors which could not be generated by rules. With the help of editing latent representation, we can generate grammatical errors with high quality and diversity.
In Table 6, we present an example of corrections generated by GEC model with different augmentation methods. The performance of GEC model without data augmentation is poor. It cannot detect the error of 'would'. The model with rule-based augmentation method can detect the error, but it fails to correct it by only changing the form of the word. The model with representation editing based method successfully corrects the error. However, it is hard for our proposed method to correct sentences with multiple errors. This problem needs to be solved in future works.

Standard Correction
Although the problem [ would → may ] not be serious, people [ would → might ] still be afraid.

Augmentation Method
Example None Although the [ problem → problems ] would not be serious, people would still be afraid.

Pre-defined Rules
Although the problem [ would → will ] not be serious, people [ would → will ] still be afraid.

Editing Latent Representation
Although the problem [ would → may ] not be serious, people would still be afraid. Both Although the problem [ would → may ] not be serious, people would still be afraid.

Conclusion
In this paper, we propose a data augmentation method to apply noise to latent space. By editing latent representations of grammatical sentences, we can generate synthetic samples with diverse error types. These synthetic training samples can further improve the performance and robustness of the GEC model and it enables the model to detect and correct most errors. We evaluate our method on public benchmarks of GEC task and it achieves the state-of-the-art performances on CoNLL-2014 and FCE datasets.