Erroneous data generation for Grammatical Error Correction

It has been demonstrated that the utilization of a monolingual corpus in neural Grammatical Error Correction (GEC) systems can significantly improve the system performance. The previous state-of-the-art neural GEC system is an ensemble of four Transformer models pretrained on a large amount of Wikipedia Edits. The Singsound GEC system follows a similar approach but is equipped with a sophisticated erroneous data generating component. Our system achieved an F0:5 of 66.61 in the BEA 2019 Shared Task: Grammatical Error Correction. With our novel erroneous data generating component, the Singsound neural GEC system yielded an M2 of 63.2 on the CoNLL-2014 benchmark (8.4% relative improvement over the previous state-of-the-art system).


Introduction
The most effective approaches to Grammatical Error Correction (GEC) task are machine translation based methods. Both Statistical Machine Translation (SMT) approaches and Neural Machine Translation (NMT) methods have achieved promising results in the GEC task.
Pretraining a decoder as a language model is an effective method to improve the performance of neural GEC systems (Junczys-Dowmunt et al., 2018). As an extension of this work, Lichtarge et al. (2018) showed pretraining on 4 billion tokens of Wikipedia edits to be beneficial for the GEC task.
In this work, we investigate a similar approach by systematically generating parallel data for pretraining. As shown in Table 1, in addition to spelling errors (price → puice), transposition errors (independent voters → voters independent) and concatenation errors (the man → theman), our Origin the primary is open to independent voters .
Generated the primary is opens to voters independhent .
Origin the price of alcohol is ramped up at every budget .
Generated the puice of alchool is ramping up at every budget .
Origin they say the police shot and killed the man after he had fired at them . Generated they say the polices shot and killed theman after he had firing at them . method also introduces errors such as ramped → ramping. Our approach obtained competitive results compared to the top systems in the BEA 2019 GEC Shared Task. Both our single model and ensemble models have exceeded the previous stateof-the-art systems on the CoNLL-2014(Ng et al., 2014) benchmark and our system reaches humanlevel performance on the JFLEG (Napoles et al., 2017) benchmark.

Related Work
Chollampatt and Ng (2018) used a convolutional sequence-to-sequence (seq2seq) model (Gehring et al., 2017) with a large language model for rescoring. Their model was the first NMT based GEC system that exceeded the strong SMT baseline system (Junczys-Dowmunt and Grundkiewicz, 2016) which combined a Phrase-based Machine Translation (PBMT) with a large language model. Then a hybrid PBMT-NMT system (Grundkiewicz and Junczys-Dowmunt, 2018) appeared to reach the new state-of-the-art on the CoNLL-2014 benchmark. Later, various pure neu-

Data
We list the training data in

Erroneous Data Generation
In this section, we describe our error generating method. For each sentence, we assign a probability distribution (as shown in Table 4) to determine the number of errors according to the sentence length. The parameters in Table 4 are determined empirically, as well as the parameters in Table 5, Table 6 and Table 7. Because of the time limitation of the GEC competition, we did not optimize these parameters.
After the number of errors (E) in a sentence has been determined, we randomly select E tokens from all the tokens of the sentence with equal probability to be errors. And for each error, we apply a random variable (Table 5) to determine which error type it should be.
• Substitution: we introduce seven different types of substitutions.
• Transposition: the token exchange position with a consecutive token.

Misspelling
To generate misspellings, we introduce a random variable to determine how many errors in the token according to the token length (parameters are shown in Table 6.), and we randomly insert errors into the token. For each spelling error, we apply another random variable to determine which error type should be. We introduce four spelling error types (Table  7 lists the parameters.).
• Deletion: delete the character.
• Insertion: insert a random English letter into the current position.
• Transposition: exchange position with the consecutive character.
• Replacement: replace the current character with a random English character.
We only introduce spelling errors into words belonging to a vocabulary list of 32k ordinary words 2 which does not include numerals (e.g., 2019), tokens that contain digits (e.g., Lang8), URLs or non-word symbols (e.g., ≥ ≤).

Substitution
We introduce seven types of substitutions according to token and its part-of-speech (POS).
• Substitution in a Word Tree (see 4.3 for details).

Word Tree
We want to make substitutions such as going → gone, useful → usable, administration → administrative. To make such substitution possible, we introduce the Word Tree. A Word Tree represents a group of words that share the same stem but have different suffixes. Figure 1 shows an example of Word Tree of "use". A node denotes a word (e.g., usable) and corresponding Extended part-of-speech (EPOS) (e.g., VBP JJ BLE) (see 4.4 for details.), and an edge indicates the root from which the word is derived (e.g., "usable" is derived from "use").
With EPOS, we can easily set rules or assign probability distributions to determine which substitutions are more likely to happen, (e.g., singular ↔ plural, VBD ↔ VBZ ↔ VBP ↔ VBN ↔ VBG, adjective ↔ adverb), and which substitutions are less likely to happen (e.g., happiest JJS ↔ happiness JJ NN). In our experiments, due to the time limitation of the competition, we simply assigned a uniform distribution to all existing words in a Word Tree, excluding substitutions that were definitely unlikely to occur such as substitutions between the words in an NN JJ F branch (e.g., careful) and the words in an NN JJ L branch (e.g., carelessness).

Extended part-of-speech
A Word Tree can contain multiple words of the same POS. As shown in the example in Figure 1, use, user and usefulness can all be nouns. Therefore, in order to identify the different roles for words in a Word Tree, we propose EPOS, derived from part-of-speech (POS) and the surface form of the word. POS explains how a word is used (mostly syn-tactically) in a sentence. Compared to POS, EPOS also reflects some semantic role of a word in a sentence. We define EPOS in Table 11 in the Appendix. We used NLTK (Bird, 2006) as our POS tagger, and use NLTK-style tags in this paper.
We briefly describe our method of creating word trees.   gree. Then we fill words into corresponding entries according to their POS tags. Words that cannot be filled in any of the above tables are filled into a list.
c. Manually check and correct all entries of the three tables, and fill missing entries as well.
d. Define EPOS as listed in Table 11 in the Appendix according to suffix transforming rules.
e. Extract a RAW list from the vocabulary according to the suffix transforming rules.
f. Create an EPOS tree structure for each token in the RAW list, and then fill each word from the vocabulary into the corresponding entry of the corresponding EPOS tree (The full structure of the EPOS Tree is described in Table 12 in the Appendix, and Figure 2 shows the Verb branch); then prune empty entries in the trees.
g. Manually check every entry of every word tree, and fix all incorrect entries.
h. Update the defined EPOS (final version in Table 11) and the EPOS tree (Table 12); recreate word trees.
i. Repeat step g and h until satisfied.

Experiments
In our experiments, we generated a corpus of 3 billion tokens, of which about 24% were errors.
Following Lichtarge et al. (2018), we also use Transformer as our encoder-decoder model, using Tensor2Tensor open source implementation 3 .
The models are trained on words, and rare words are segmented into sub-words with the byte pair encoding (BPE) (Sennrich et al., 2015). We use 6 layers for both encoder and decoder, and 4 attention heads. The embedding size and hidden size are 1024, and the filter size for all positionwise feed forward network is 4096. We set dropout rate to 0.3, and source word dropout is set to 0.2 as a noising technique. Following Junczys-Dowmunt et al. (2018), source, target and output embeddings are tied in our models.
Following Lichtarge et al. (2018), we first trained our model on an artificially generated parallel corpus with a batch size of approximately 3072 tokens. Then we set the batch size to 2048 tokens and fine-tuned on human annotated data for 20 epochs, and we averaged the 5 best checkpoints. Finally, the averaged model was fine-tuned on the ABCN and FCE training data for 1000 steps as domain adaptation (Junczys-Dowmunt et al., 2018).
There are about 50% sentence pairs without any correction in the Lang-8 dataset, and we noticed that training with too many error-free sentence pairs had a negative effect. Therefore, we filtered out these error-free sentence pairs in the Lang-8 dataset. Since the NUCLE, FCE and ABCN datasets are much smaller than the Lang-8 set, we did not filter out the error-free sentence pairs in these datasets.
We used beam search for decoding with a beam size of 4 at evaluation time. For the ensemble, we averaged logits from 4 Transformer models with identical hyper-parameters at each decoding step. Following (Grundkiewicz and Junczys-Dowmunt, 2018; Junczys-Dowmunt et al., 2018; Lichtarge et al., 2018), we preprocessed the JF-LEG dataset with spell-checking. We did not apply spell-checking to the ABCN and CoNLL-2014 datasets.

Results and Discussion
The results of the Singsound System in the GEC competition (Table 8) were obtained by an ensemble of four models. Because of the time limitation, we only trained two independent models   from scratch. The other two were based on existing trained models. Concretely, after we got a model trained from scratch, we kept training it on the generated corpus for 0.2 epoch; then fine-tuned the model on the annotated data and ABCN and FCE training sets as before.
We provide the performance of our single model and the ensemble of 4 independently trained models 4 on the ABCN dev and test datasets in Table 9. As the results shown in Table 9, models pretrained on the generated corpus significantly outperform the models without pretraining.
To compare with previous state-of-the-art GEC systems, we evaluated our systems on the CoNLL-2014 and JFLEG datasets. As the results shown in Table 10, our single model exceeded previous state-of-the-art systems on the CoNLL-2014 dataset. Our ensemble models achieved 8.4% relative improvement over the latest state-of-the-art results on the CoNLL-2014 benchmark.
We also report the results on the CoNLL-2014 10 annotation dataset (denoted as CoNLL-10) (Bryant and Ng, 2015) which is an extension of the CoNLL-2014 test set with 10 annotators. The human-level scores are calculated by averaging the scores for each annotator with regard to the remaining annotators. Following Chollampatt and Ng (2017), scores on CoNLL-10 (SvH) are calcu-lated by removing one set of human annotations at a time and evaluating the system against the remaining sets. Our models reach human-level performance on both CoNLL-10 and JFLEG benchmarks.

Conclusion
In this work, we present a novel erroneous data generating method for training English GEC models. Our experiments show that Transformer models pretrained on generated corpus significantly outperform the previous GEC systems that are also based on Transformer. We also present a novel tool: the Word Tree, which represents a group of words that share the same stem but have different suffixes; and we show that one possible application of the Word Tree is generating erroneous text for training GEC models.