Language Model Based Grammatical Error Correction without Annotated Training Data

Since the end of the CoNLL-2014 shared task on grammatical error correction (GEC), research into language model (LM) based approaches to GEC has largely stagnated. In this paper, we re-examine LMs in GEC and show that it is entirely possible to build a simple system that not only requires minimal annotated data (∼1000 sentences), but is also fairly competitive with several state-of-the-art systems. This approach should be of particular interest for languages where very little annotated training data exists, although we also hope to use it as a baseline to motivate future research.


Introduction
In the CoNLL-2014 shared task on Grammatical Error Correction (GEC) , the top three teams all employed a combination of statistical machine translation (SMT) or classifierbased approaches (Junczys-Dowmunt and Grundkiewicz, 2014;Felice et al., 2014;Rozovskaya et al., 2014). These approaches have since come to dominate the field, and a lot of recent research has focused on fine-tuning SMT systems (Junczys-Dowmunt and Grundkiewicz, 2016), reranking SMT output , combining SMT and classifier systems Rozovskaya and Roth, 2016), and developing various neural architectures Xie et al., 2016;Chollampatt and Ng, 2017;Yannakoudakis et al., 2017).
Despite coming a fairly competitive fourth in the shared task however (Lee and Lee, 2014), research into language model (LM) based approaches to GEC has largely stagnated. The main aim of this paper is hence to re-examine language modelling in the context of GEC and show that it is still possible to achieve competitive results even with very simple systems. In fact, a notable strength of LM-based approaches is that they rely on very little annotated data (purely for tuning purposes), and so it is entirely possible to build a reasonable correction system for any language given enough native text. In contrast, this is simply not possible with SMT and other popular approaches which always require (lots of) labelled data.

Methodology
The core idea behind language modelling in GEC is that low probability sequences are more likely to contain grammatical errors than high probability sequences. For example, *discuss about the problem is expected to be a low probability sequence because it contains an error while discuss the problem or talk about the problem are expected to be higher probability sequences because they do not contain errors. The goal of LM-based GEC is hence to determine how to transform the former into the latter based on LM probabilities. 1 With this in mind, our approach is fundamentally a simplification of the algorithm proposed by Dahlmeier and Ng (2012a). It consists of 5 steps and is illustrated in Table 1: 1. Calculate the normalised log probability of an input sentence. 2. Build a confusion set, if any, for each token in that sentence. 3. Re-score the sentence substituting each candidate in each confusion set. 4. Apply the single best correction that increases the probability above a threshold. 5. Iterate steps 1-4.
One of the main contributions of this paper is hence to re-evaluate the LM approach in relation to the latest state-of-the-art systems on several benchmark datasets.  Table 1: A step-by-step example of our approach as described in Section 2. All scores are log probabilities.

Sequence Probabilities
We evaluate hypothesis corrections in terms of normalised log probabilities at the sentence level. Normalisation by sentence length is necessary to overcome the tendency for shorter sequences to have higher probabilities than longer sequences. Dahlmeier and Ng (2012a) similarly used normalised log probabilities to evaluate hypotheses, but did so as part of a more complex combination of other features. In contrast, Lee and Lee (2014) evaluated hypotheses in terms of sliding five word windows (5-grams).

Confusion Sets
One of the defining characteristics of LM-based GEC is that the approach does not necessarily require annotated training data. For example, spellcheckers and rules both formed key parts of Dahlmeier and Ng's and Lee and Lee's systems. While Lee and Lee ultimately did make use of annotated training data however, Dahlmeier and Ng instead employed separate classifiers for articles, prepositions and noun number errors trained only on native text. In this work, we focus on correcting the following error types in English: non-words, morphology, and articles and prepositions. 2 Non-words: We use CyHunspell 3 v1.2.1 with the latest British English Hunspell dictionaries 4 to generate correction candidates for non-word errors. Non-words include genuine misspellings, such as [freind → friend], and inflectional errors, such as [advices → advice]. Although CyHunspell is not a context sensitive spell checker, the proposed corrections are evaluated in a context sensitive manner by the language model. To generate correction candidates for morphological errors, we use an Automatically Generated Inflection Database (AGID), 5 which contains all the morphological forms of many English words. The confusion set for a word is hence derived from this database.
Articles and Prepositions: Since articles and prepositions are closed class words, we defined confusion sets for these error types manually. Specifically, the article confusion set consists of { , a, an, the}, while the preposition confusion set consists of the top ten most frequent prepositions: { , about, at, by, for, from, in, of, on, to, with}. Both sets also contain a null character which represents a deletion. Unlike Dahlmeier and Ng and Lee and Lee, we do not yet handle missing words (∼20% of all errors) because it is often difficult to know where to insert them.

Iteration
The main reason to iteratively correct only one word at a time is because errors sometimes interact. For example, correcting [see → seeing] in Table 1 initially reduces the log probability of the input sentence from -2.71 to -3.09. After correcting [foray → forward] however, [see → seeing] subsequently increases the probability of the sentence from -1.80 to -1.65 in the second iteration. Consequently, correcting the most serious errors first, in terms of language model probability increase, often helps facilitate the correction of less serious errors later. Dahlmeier and Ng and Lee and Lee both also used iterative correction strategies in their systems, but did so as part of a beam search or pipeline approach respectively. CoNLL-2013 NLTK  1381  1  3404  CoNLL-2014 NLTK  1312  2  6104  FCE-dev  spaCy  2371  1  4419  FCE-test  spaCy  2805  1  5556  JFLEG-dev  NLTK  754  4 10576  JFLEG-test  NLTK  747  4 10082   Table 2: Various stats about the learner corpora we use.

Data and Resources
In all our experiments, we used a 5-gram language model trained on the One Billion Word Benchmark dataset (Chelba et al., 2014) with KenLM (Heafield, 2011). While a neural model would likely result in better performance, efficient training on such a large amount of data is still an active area of research (Grave et al., 2017). Although LM-based GEC does not require annotated training data, a small amount of annotated data is still required for development and testing. We hence make use of several popular GEC corpora, including: CoNLL-2013and CoNLL-2014(Ng et al., 2013, the public First Certificate in English (FCE) (Yannakoudakis et al., 2011), and JFLEG (Napoles et al., 2017).
Since the FCE was not originally released with an official development set, we use the same split as Rei and Yannakoudakis (2016), 6 which we tokenize with spaCy 7 v1.9.0. We also reprocess all the datasets with the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017) in an effort to standardise them. This standardisation is especially important for JFLEG which is not explicitly annotated and so otherwise cannot be evaluated in terms of F-score. Note that results on CoNLL-2014 and JFLEG are typically higher than on other datasets because they contain more than one reference. See Table 2 for more information about each of the development and test sets.

Tuning
The goal of tuning in our LM-based approach is to determine a probability threshold that optimises F 0.5 . For example, although the edit [am → was] in Table 1 increases the normalised sentence log probability from -2.71 to -2.67, this is such a small improvement that it is likely to be a false positive. In order to minimise false positives, we hence set 6 https://ilexir.co.uk/datasets/index. a threshold such that a candidate correction must improve the average token probability of the sentence by at least X% before it is applied. Although it may be unusual to use percentages in log space, this is just one way to compare the difference between two sentences which we found worked well in practice. The results of this tuning are shown in Figure 1, where we tried thresholds in the range of 0-10% on three different development sets. It is notable that the optimum threshold for CoNLL-2013 (2%) is very different from that of FCE-dev (4%) and JFLEG-dev (5%), which we suspect is because each dataset has a different error type distribution. For example, spelling errors make up just 0.3% of all errors in CoNLL-2013, but closer to 10% in FCE-dev and JFLEG-dev.
Finally, it should be noted that this threshold is an approximation and it is certainly possible to optimise further. For example, in future, thresholds could be set based on error types rather than globally.

Results and Discussion
Before evaluating performance on the test sets, a final post-processing step changed the first alphabetical character of every sentence to upper case if necessary. This improved the scores by about 0.3 F 0.5 in CoNLL-2014 and FCE-test, but by over 5 F 0.5 in JFLEG-test. This surprising result once  again shows that different test sets have very different error type distributions and that even the simplest of correction strategies can significantly affect results.
Our final scores are shown in Table 3 where they are compared with several state-of-the-art systems. Unfortunately, we cannot compare results with Dahlmeier and Ng (2012a) because this system is neither publicly available nor has previously been evaluated on these test sets. Results are reported in terms of M2 F 0.5 (Dahlmeier and Ng, 2012b), the de facto standard of GEC evaluation; ERRANT F 0.5 (Bryant et al., 2017), an improved version of M2 which we used to develop our system; and GLEU (Napoles et al., 2015), an ngram-based metric designed to correlate with human judgements. Results for ERRANT are not available in all cases because system output is not available.
At this point, it is worth reiterating that our main intention was not to necessarily improve upon the state-of-the-art, but rather quantify the extent to which a simple LM-based approach with minimal annotated data could compete against a much more sophisticated model trained on millions of words of annotated text. This is especially relevant for languages where annotated training data may not be available.
With this in mind, we were firstly pleased to improve upon the previous best LM-based approach by Lee and Lee (2014) in the CoNLL-2014 shared task. This is especially significant given we also did so without any annotated training data (unlike them). Although our system would still have placed fourth overall, the gap between third and fourth decreased from 3 F 0.5 to less than 1 F 0.5 . We were also surprised by the high performance on JFLEG-test, where we not only outperformed two state-of-the-art systems, but also came to within 2 F 0.5 of the top system. This is especially surprising given our system only corrects a limited number of error types (roughly 14 out of the 55 in ERRANT 8 ), and so can maximally correct only 40-60% of all errors in each test set. One possible explanation for this is that unlike CoNLL-2014 and FCE-test, which were only corrected with minimal edits, JFLEG was corrected for fluency (Sakaguchi et al., 2016), and so it intuitively makes sense that LM-based approaches perform better with fluent references.
Although we did not perform as well on CoNLL-2014 or FCE-test, most likely for the same reason, we also note a large discrepancy between state-of-the-art systems tuned on different datasets.
For example, while AMU16 SM T +LSTM tuned for CoNLL achieves the highest result on CoNLL-2014 (49.66 F 0.5 ), its equivalent performance on FCE-test (32.06 F 0.5 ) is only marginally better than our own (31.22 F 0.5 ). We observe a similar effect with CAMB16 SM T +LSTM tuned for the FCE, and so are wary of approaches that might be overfitting to their training corpora.
We make all our code and system output available online. 9

Conclusion
In this paper, we have shown that a simple language model approach to grammatical error correction with minimal annotated data can still be competitive with the latest neural and machine translation approaches that rely on large quantities of annotated training data. This is especially significant given that our system is also limited by the range of error types it can correct. In the future, we hope to improve our system by adding the capability to correct other error types, such as missing words, and also make use of neural language modelling techniques.
We have demonstrated that LM-based GEC is not only still a promising area of research, but one that may be of particular interest to researchers working on languages where annotated training corpora are not yet available. We released all our code and system output with this paper.