Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic grammatical errors would be difficult, one could learn the distribution of naturally-occurring errors and attempt to introduce them into other datasets. Initial work on inducing errors in this way using statistical machine translation has shown promise; we investigate cheaply constructing synthetic samples, given a small corpus of human-annotated data, using an off-the-rack attentive sequence-to-sequence model and a straight-forward post-processing procedure. Our approach yields error-filled artificial data that helps a vanilla bi-directional LSTM to outperform the previous state of the art at grammatical error detection, and a previously introduced model to gain further improvements of over 5% F0.5 score. When attempting to determine if a given sentence is synthetic, a human annotator at best achieves 39.39 F1 score, indicating that our model generates mostly human-like instances.


Introduction
There is an ever-growing number of people learning English as a second language; providing them with quick feedback to facilitate their learning is a crucial, labour-intensive endeavour. Part of this process is identifying and correcting grammatical errors, and several computational techniques have been developed to automate it (Rozovskaya and Roth, 2014;Junczys-Dowmunt and Grundkiewicz, 2016). For example, given an erroneous sentence "I wanted to goes to the beach", the grammatical error correction task is to output the valid sentence "I wanted to go to the beach". The task can be cast as a two-stage process, detection and correction, which can either be per-formed sequentially (Yannakoudakis et al., 2017), or jointly (Napoles and Callison-Burch, 2017).
Automated error correction performance is arguably still too low for practical consideration, perhaps limited by the amount of training data . High quality annotations are expensive to procure, and foreign language learners and commercial entities may feel uncomfortable granting access to their data. Instead, one could attempt to supplement existing manual annotations with synthetic instances. Such artificial samples are beneficial only when they share structure with the true distribution from which human errors are generated. Generative Adversarial Networks (Goodfellow et al., 2014) could be used for this purpose, but they are difficult to train, and require a large collection of sentences that are incorrect. One might attempt self-training (McClosky et al., 2006), where new instances are generated by applying a trained model to unannotated data, using high-confidence predictions as ground truth labels. However, in such a scheme, the expectation is that the unlabelled text already contains errors, which is not usually the case for most freely available text such as Wikipedia articles as they strive towards correctness.
In place of using machine translation (MT) to correct grammatical mistakes (Yuan and Felice, 2013;Junczys-Dowmunt and Grundkiewicz, 2014;Yuan and Briscoe, 2016), one might consider swapping the input and output streams, and instead learn to induce errors into error-free text, for the purpose of creating a synthetic training dataset (Felice and Yuan, 2014). Recently,  used a statistical MT (SMT) system to induce errors into error-free text. Building on this work, and leveraging recent advances in neural MT (NMT), we used an off-the-shelf attentive sequence-to-sequence model (Britz et al., 2017), eliminating the need of specialised soft-ware such as a phrase-table generator, decoder, and part-of-speech tagger. We created multiple synthetic datasets from in-domain and outof-domain sources, and found that stochastic token sampling, and pruning redundant and lowlikelihood sentences, were helpful in generating meaningful corruptions. Using the artificial samples thus generated, we improved upon detection results with simply a vanilla bi-directional LSTM (Hochreiter and Schmidhuber, 1997). Using a more powerful model, we established new state-of-the-art results, that improve on previously published F 0.5 scores by over 5%. Additionally, we confirm that our generated instances are human-like, as an annotator identifying generated sentences achieved a maximum F 1 score of 39.39.

Related work
In computer vision, images are blurred, rotated, or otherwise deformed inexpensively to create new training instances (Wang and Perez, 2017), because such manipulation does not significantly alter the image semantics. Similar coarse processes do not work in NLP since mutating even a single letter or a word can change a sentence's meaning, or render it nonsensical. Nonetheless, Vinyals et al. (2015) employed a kind of selftraining where they use noisy predictions for unlabelled instances output by existing state-of-theart parsers as ground-truth labels, and improved syntactic parsing performance. Sennrich et al. (2016) synthesised training instances by roundtrip-translating a monolingual corpus with weaker versions of an NMT learner, and used them to improve the translation. Bouchard et al. (2016) developed an efficient algorithm to blend generated and true data for improving generalisation.
Grammar correction is a well-studied task in NLP, and early systems were rule-based pattern recognisers (Macdonald, 1983) and dictionarybased linguistic analysis engines (Richardson and Braden-Harder, 1988). Later systems used statistical approaches, addressing specific kinds of errors such as article insertion (Knight et al., 1994) and spelling correction (Golding and Roth, 1996). Most recently, architectural innovations in neural sequence labelling Rei, 2017) raised error detection performance through improved ability to process unknown words and jointly learning a language model. Early efforts for artificial error generation in-cluded generating specific types of errors, such as mass noun errors (Brockett et al., 2006) and article errors (Rozovskaya and Roth, 2010), and leveraging linguistic information to identify error patterns and transfer them onto grammatically correct text (Foster and Andersen, 2009;Yuan and Felice, 2013). Imamura et al. (2012) investigated methods to generate pseudo-erroneous sentences for error correction in Japanese. Recently,  corrupted error-free text using SMT to create training instances for error detection.

Neural error generation
To learn to introduce errors, we use an off-theshelf attentive sequence-to-sequence neural network (Bahdanau et al., 2014). Given an input sequence, the encoder generates context vectors for each token. Then, the attention mechanism and the decoder work in tandem to emit a distribution over the target vocabulary. At every decoder timestep, the encoder context vectors are scored by the attention mechanism, and a weighted sum is supplied to the decoder, along with its propagated internal state and last output symbol.
Corruption: Tokens from this distribution are sampled at every decoder time-step, either by argmax (AM), which emits the most likely word, or by a stochastic alternative such as temperature sampling (TS) as argmax cannot be relied on to generate rare words. A temperature parameter τ > 0 sharpens or softens the distribution: where i are the components of the probability distribution corresponding to words in the vocabulary. As one interpolates τ from 0 to 1, the behaviour ofp transitions from argmax to p, controlling the diversity of the generated tokens. The sentence generated by TS might be a low probability sequence from the joint conditional distribution P (v|u), where u is the input sentence and v is the output sentence. One way around this is to use beam search (BS), which checks the likelihood of every possible continuation of a sentence fragment, and maintains a list of the n best translations generated up to the current time-step. AM, TS, and BS are indicative of the trade-off between increasing levels of model flexibility at the cost of computation; we compare them to assess whether   the additional computations were helpful in creating high-quality synthetic instances.
Post-processing: Original and corrupted sentences are aligned at a word-level using Levenshtein distance. Using the minimal alignment, words in the corrupted sentence are labelled correct, 'c', or incorrect, 'i', as follows: If the word is not aligned with itself, then 'i'. Else, if following a gap, then 'i', as at this point a human reader would notice that there is a word missing in the sentence. Else, if it is the last word, but it is not aligned to the last word of the source sentence, then 'i', as a human would realise that this sentence ends abruptly, Else, 'c'.
These token-labelled corrupted sentences now form an artificial dataset for training an error detector. Duplicate instances and corrupted sentences with more than 5 errors were dropped to remove noise from the downstream training.

Experiments
We evaluated our approach on the First Certificate of English (FCE) error detection dataset (Rei and Yannakoudakis, 2016), as well as on two humanannotated test sets (CoNLL1, CoNLL2) from the CoNLL 2014 shared task (Ng et al., 2014). The CoNLL data sets pose a unique challenge; as they are different in style and domain from FCE, we have no matching training data. We compared the effect of different neural generation procedures (AM, TS, BS) and contrasted the downstream performance of a bidirectional LSTM with an elaborate sequence labeller.

Implementation details
NMT training and corruption: We minimally modified the open source implementation 1 of Britz et al. (2017) to implement TS and BS. 2 We trained our NMT with a single-layered encoder and decoder with cell size 256, on the parallel corpus version of FCE (Yannakoudakis et al., 2011), with early stopping after the FCE development set score dropped consistently for 20 epochs. We introduced errors into three datasets: FCE itself (450K tokens), the English Vocabulary Profile or EVP (270K tokens) and a subset of Simple Wikipedia or SW (8.4M tokens); of these, FCE and EVP were both used in artificial error generation via SMT and pattern extraction (PAT) by , enabling us to make a fair experimental comparison. Ten corrupted versions using each of AM, TS (τ = 0.05) and BS were sampled for FCE and EVP corruptions, while one sufficed for SW. The theoretical time complexity of BS is O(bn) for each sentence, where b is num- ber of candidates, and n is the maximum length of a sentence. Empirically, BS with b = 11 took a factor of 11.3 more time than AM. Examples of generated errors are provided in Table 1.
Error detection: We compare two error detection models: a vanilla bi-directional LSTM (BiL-STM) (Schuster and Paliwal, 1997), and the stateof-the-art sequence labeller (SL) neural network used by . These models were trained on the binary-labelled FCE training set augmented with the corrupted instances. Wherever no model is explicitly stated, the SL model was used. During training, we alternate between the annotated FCE dataset and the synthetic collection. This alternating protocol prevents overfitting on FCE; once it shifts back, it reinforces connections made from the helpful synthetic corruptions while forgetting about the noisy ones.

Results
The results for our baselines and data augmentation strategies can be found in Table 2. Augmented with our NMT generated data, even our vanilla downstream BiLSTM outperforms the SMT+PAT artificial error augmentation approach of , indicating that our process better generalises the error information in the source dataset.
Using the more powerful SL network bests the previous state of the art by over 5% on the FCE test. Most intriguingly, we note a significant improvement for the CoNLL tests using corruptions from out-of-domain SW. Figure 2 illustrates how we gain performance on these tests with increasing amounts of corrupted SW, which does not hold true for corrupted FCE. This shows that we were  able to induce useful errors into a corpus with a large unseen vocabulary and different syntactic biases, and this in turn proved valuable for detecting errors in a third domain, suggesting that our method can transfer learned distributions across stylistic genres. Using EVP as a standard source, Figure 1 illustrates the variance of the different sampling methods. All generation methods yield corruptions that significantly improve test performance, with instances sampled by beam-search consistently outperforming the alternatives.

Error distribution
The original FCE dataset was annotated using the error taxonomy specified in Nicholls (2003), and contains 75 unique error codes. We annotated samples of EVP corrupted by all three sampling methods, at a reduced resolution, to compare the distribution of errors across FCE and the synthetic corpora. These are presented in Table 3.
At a high level, NMT generates errors more often among more common parts-of-speech, favouring errors in verbs and nouns, rather than in adverbs and conjunctions. It did not make spelling errors as often as in the source dataset; this is likely because it only observed the specific spelling errors present in FCE, and as the vocabulary is restricted to that dataset, it does not encounter those words as frequently in EVP, and thus rarely makes the same spelling mistakes.
Additionally, the differences in these distributions can partially be attributed to the implicit differences between us and the annotators of FCE.

Comparison with human errors
To check if the synthetic instances passed for human-like, we mixed 50 generated sentences among an equal number of actual ungrammatical instances from FCE-dev and tasked a human evaluator to identify the artificial statements, in a simple Turing-style test. We created three such sets, one for each of our sampling techniques, and the test subject aimed to identify synthetic samples with high confidence. Results of this test are presented in Table 4.
The high precision but low recall scores suggest that while it is still possible to spot some corruptions that are quite clearly artificial, the bulk of our samples do not betray their synthetic nature and are indistinguishable from naturally occurring erroneous sentences. In order to fairly compare our work with earlier results, we intended to conduct such a test for sentences generated by the SMT of . Unfortunately, we were only able to source corruptions of FCE-train via this method; therefore, we decided not to perform this test as its results cannot be compared to ours.

Conclusions and future work
We presented a novel data augmentation technique for grammatical error detection using neural machine translation to learn the distribution of language-learner errors, and induce such errors into grammatically correct text. We explored several different variants of sampling to improve the quality of our synthetic errors. After creating artificial training instances with an off-theshelf NMT, we bettered previous state-of-the-art results on the canonical test with even a basic BiL-STM, and established a new state of the art using a stronger model. Additionally, we demonstrated that we were able to leverage corruptions of an out-of-domain dataset to set new benchmarks on separate, also out-of-domain tests, without specifically optimising for either. Our work indicates that neural error generation warrants further investigation with different datasets and architectures, both for error detection and error correction. Among possible future work is using generative adversarial networks as corruption engines, and developing better sequence alignment methods. Some preliminary results with simple corruptions using word substitution and word dropout (Iyyer et al., 2015) appear to be promising, and may feature as components of a future corruption system. Finally, one could use such artificial error-prone corpora as source text for self-training an error detection system.