Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.


Introduction
Grammatical error correction (GEC) is the task of automatically editing text to remove grammatical errors; for example: [A link to registration can also be found at on the same page.]. GEC systems so far have primarily focused on correcting essays produced by English-as-a-second-language (ESL) learners, providing fast and inexpensive feedback to facilitate language learning. However, this is only one target domain in the full spectrum of GEC applications. GEC models can also help to improve written communication outside of the formal education setting. Today the largest medium of written communication is the internet, with approximately 380 new websites created every minute. 1 Ensuring grammatical correctness of websites helps facilitate clear communication and a professional commercial presentation. Therefore, it is important that GEC models perform well in the open-domain setting and generalize, not only to writing produced in the educational context, but also to language production "in the wild". Website data specifically represent a broad and diverse range of writing and constitute a major part of what people read and write on an everyday basis. This work highlights two major prevailing challenges of current approaches to GEC: domain adaptation and low precision in texts with low error density. Previous work has primarily targeted essaystyle text with high error density (see Figure 1); however, this lack of diversity means that it is not clear how systems perform on other domains and under different error distributions . 2 Current publicly available datasets are restricted to non-native English essays [e.g. FCE (Yannakoudakis et al., 2011); CoNLL14 (Ng et al.,

Error type
Example sentence VERB:SVA They develop positive relationships with swimmers and members, and promotes promote programs in order to generate more participation. MORPH / ORTH In a small agriculture agricultural town on the east side of Washington state State called Yakima. PREP [. . . ] the distance between the two should be on of the order of 50 microns. 2014)], student essays [W&I+LOCNESS (Bryant et al., 2019;Granger, 1998)] or target a specific domain [scientific writing; AESW (Daudaravicius et al., 2016)]. Supervised systems trained on specific domains are less likely to be as effective at correcting distinctive errors from other domains, as is the case for systems trained on learner data with different native languages (Chollampatt et al., 2016;Nadejde and Tetreault, 2019). The recent BEA 2019 shared task (Bryant et al., 2019) encouraged research in the use of low-resource and unsupervised approaches; however, evaluation primarily targeted the restricted domain of student essays. We show that when applied to data outside of the language learning domain, current state-of-the-art systems exhibit low precision due to a tendency to over-predict errors. Recent work tackled the domain adaptation problem, and released GEC benchmarks from Wikipedia data and online comments [GMEG Wiki+Yahoo (Napoles et al., 2019)]. However, these datasets present a high density of errors and represent a limited subset of the full distribution of errors in online writing.
Contributions: We (i) release a new dataset, CWEB (Corrected Websites), of website data that is corrected for grammatical errors; 3 (ii) systematically compare it to previously released GEC corpora; (iii) benchmark current state-of-the-art GEC approaches on this data and demonstrate that they are heavily biased towards existing datasets with high error density, even after fine-tuning on our target domain; (iv) perform an analysis showing that a factor behind the performance drop is the inability of systems to rely on a strong internal language model in low error density domains. We hope that the new dataset will contribute towards the development of robust GEC models in the open-domain setting.

CWEB Dataset
We create a new dataset of English texts from randomly sampled websites, and annotate it for grammatical errors. The source texts are randomly selected from the first 18 dumps of the Common-Crawl 4 dataset and represent a wide range of data seen online such as blogs, magazines, corporate or educational websites. These include texts written by native or non-native English speakers and professional as well as amateur online writers.
Text Extraction To ensure English content, we exclude websites with country-code top-level domains; e.g., .fr, .de. We use the jusText 5 tool to retrieve the content from HTML pages (removing boilerplate elements and splitting the content into paragraphs). We heavily filter the data by removing paragraphs which contain non-English 6 and incomplete sentences. To ensure diversity of the data, we also remove duplicate sentences. Among the million sentences gathered, we select paragraphs randomly.
We split the data with respect to where they  Table 3: Statistics on GEC Corpora; type-token is the average ratio of vocabulary size by the total number of tokens (calculated as an average over a sliding window of 1, 000 tokens); ratio of edits per sentence is calculated on erroneous sentences; sent-K is sentence-level Cohen's Kappa score ( †: calculated for datasets with > 1 annotator); NEs stands for Named Entities (extracted using Spacy).
come from: sponsored 7 (CWEB-S) or generic 8 (CWEB-G) websites. The sponsored data represent a more focused domain (professional writing) than the generic one which includes writing from various proficiency levels.
Annotation The data is corrected for errors by two expert annotators, trained for correcting grammatical errors in English text: not attempting to rewrite the text nor make fluency edits, but rather to make minimal edits -minimum number of edits to make the text grammatical. During error annotation, the annotators have access to the entire paragraph in which a sentence belongs, therefore using the context of a sentence to help them in the correction. Examples of erroneous sentences from our data are shown in Table 1. Annotator agreement is calculated at the sentence level using Cohen's Kappa, i.e. we calculate whether annotators agree on which sentences are erroneous. This approach is preferable to relying on exact matching of error corrections, as as there are often many different ways to correct a sentence (Bryant and Ng, 2015). Kappa is 0.39 and 0.44 for sponsored (CWEB-S) and generic website (CWEB-G) data respectively, and Table 3  The texts are tokenized using SpaCy 9 and automatically labeled for error types (and converted into the M2 format) using the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017).
Release For each dataset, we release a development and a test set: we propose a roughly equal division of the data into the two splits, which presents a fair amount of errors to evaluate on (see Table 2).
To avoid copyright restrictions, we split the collected paragraphs into sentences and shuffle all sentences in order to break the original and coherent structure that would be needed to reproduce the copyrighted material. This approach has successfully been used in previous work for devising web-based corpora (Schäfer, 2015;Biemann et al., 2007). The data is available at https:

GEC Corpora
We compare our data with existing GEC corpora which cover a range of domains and proficiency levels. Table 3 presents a number of different statistics and Table 4 14 Sentences that have been edited are more likely to contain grammatical errors, and grammatical errors will therefore be over-represented. This is reflected in the 82.3% erroneous sentence rate (see Table 3). 15 We exclude sentences that use AESW's normalization scheme (e.g. citations replaced with CITE ), as the models we use are not trained with these special tokens. 16 (Mizumoto et al., 2011), and NUCLE (Dahlmeier et al., 2013) and for GEC-PSEUDODATA additionally on the W&I train split (Bryant et al., 2019).
Performance is evaluated using the F 0.5 metric calculated by ERRANT (Bryant et al., 2017). 18 However, the more annotators a dataset has, the higher score a system will get on this data (Bryant and Ng, 2015). In order to perform a fair comparison of systems across datasets with a different number of annotators, we calculate the ERRANT score against each individual annotator and then take the average to get the final score.
Evaluation results are presented in Table 5. Across all datasets, we observe lower scores with the PIE system (−6.05 F 0.5 on average), while GEC-PSEUDODATA is consistently better. Overall F 0.5 ranges from around 30 to 52 for most datasets; however, when the models are evaluated on CWEB and AESW, we observe a substantial drop in performance, with the lowest F 0.5 score being the PIE system on CWEB-S (6.15). Precision, in particular, suffers due to the systems over-correcting sentences that should remain unchanged.
Using the GEC-PSEUDODATA system, on average, we find a higher F 0.5 on ESL corpora (JF-LEG, FCE, CoNLL, W&I) compared to non-ESL ones (47.4 vs. 29.0). This demonstrates that GEC systems trained on language learning data do not perform as well on other domains and further work is needed to improve their generalization. 18

Fine-tuning
We investigate the extent to which the GEC-PSEUDODATA system can be adapted to our domain, and fine-tune it using our development sets. 19 We take 1, 000 sentences from each of the development sets of CWEB-G and CWEB-S and use them as a development set for this experiment. The remaining 4, 729 sentences of our development sets are used as training data for fine-tuning the GEC system. In Table 6, we can see that fine-tuning substantially improves performance (around +10.0 F 0.5 across all CWEB sets). In particular, precision is improved (+20.8/+18.6 on CWEB-G/S) at the expense of recall (−6.4/−2.8 on CWEB-G/S). However, performance is still low compared to the language learning domain (F 0.5 of at least 41), further indicating that there is scope for developing more robust and general-purpose, open-domain GEC systems. For the purpose of future benchmarking, Appendix B lists the system's ERRANT scores based on both annotators -as opposed to the average of individual annotator scores reported in Table 6.

Analysis
In order to assess the impact our new dataset can have on the GEC field, we carry out analyses to show 1) to what degree the domain of our data is different from existing GEC corpora, and how existing GEC systems are affected by the domain shift; and 2) that a factor behind the performance drop on CWEB data is the inability of systems to rely on a strong internal language model in low error density domains.

Domain Shift
Moving from error correction in learner texts to error correction in diverse, online texts, many of which are written by professional writers, amounts to a drift in data distribution. In general, distributional drift comes in different flavors; given two distributions P (X, Y) and Q(X, Y): Covariate shift concerns change in the marginal distribution of the independent variable, i.e., P (X) = Q(X). In the context of grammatical errors, this refers to the degree to which the type of sentences written varies between domains. Table 3 clearly shows covariate shift effects: see, for example, differences in vocabulary variation (measured by the type-token ratio) and the frequency of named entities.
Label bias describes the change in distribution of the dependent variable, i.e., P (Y) = Q(Y). In terms of GEC, this refers to the difference in error distributions across domains. In Table 3, we can see that CWEB data contains errors that are substantially more sparse than other domains -a smaller proportion of sentences are erroneous, and these erroneous sentences also contain fewer edits compared to other domains. Additionally, looking at Table 4, we can see that almost all error types are substantially less frequent in our data than in existing benchmarks -for example, spelling errors are 38 times more prevalent in GMEG Wiki compared to CWEB-S.
Moving from learner text to web data involves both forms of drift: covariate shift and label bias. We further analyze the effects of these shifts on system performance.

Impact of Error Density
To demonstrate that the error density of corpora has a substantial impact on the performance of GEC systems, we vary the proportion of erroneous sentences in each dataset by either removing correct sentences or by adding correct sentences of the same domain. 20 By fixing the frequency of errors across datasets, we can observe, in isolation, how the systems are affected by co-variate shift across domains. Precision as a function of the proportion of erroneous sentences for selected datasets 21 is presented in Figure 2 (recall is unchanged).
For each domain, we observe precision being highly sensitive to the proportion of errors. This indicates that differences in error distribution across domains (i.e. label bias) is likely to be a large contributor to performance drop. We also observe the effect of covariate shift across the datasets: while the percentage of erroneous sentences is the same, precision differs for the different datasets which suggests that covariate shift across domains has an impact on the performance of the system.

Analysis of Gold Edits
In addition to error density, the type of errors present in the dataset also has an impact on the performance of GEC systems. We investigate how errors and their corresponding corrections differ across domains. In particular, we look at how gold edits in different domains change the sentence in terms of two factors: 1) How much do edits change the semantics of the sentence, and 2) to what degree do edits improve the sentence. We limit our analysis to sentences containing exactly one edit, as we are interested in how individual edits change a sentence, regardless of how domains differ in amounts of erroneous sentences and in the number of edits per sentence (Table 3).
Regarding 1), to measure the semantic change of a sentence after an edit is introduced, we use sentence embeddings generated by Sentence-BERT (Devlin et al., 2019) and calculate the cosine similarity between the original sentence and its corrected counterpart. Regarding 2), the degree of sentence improvement is calculated as the ratio of the perplexity of GPT-2 (Radford et al., 2019) on a sentence after and before it has been edited.

∆P = P P L(edited sentence) P P L(original sentence)
A lower ratio suggests that the edited sentence is an improvement, since its perplexity is lower than the original sentence.
Using the outputs of machine learning models as a proxy for semantic change and sentence improvement inevitably introduces biases, but nevertheless provide valuable insights into domain differences. Figure 3, the average semantic similarity and perplexity ratio is plotted for each dataset. It is evident that ESL datasets consist of edits with a higher degree of semantic change and sentence improvements than datasets from more ad- vanced speakers. CWEB and AESW in particular stand out, with edits that largely retain the semantics of a sentence and that result in more subtle improvements.

Corpus Level In
Error type level In order to gain further insight on what is driving the differences between datasets, we look separately at how edits of each error type change the sentence. We compare FCE and CWEB-S, which lie at opposite ends in Figure 3. For each  dataset, we obtain an average of semantic similarity, S, and perplexity ratio, P , separately for sentences of each error type. Then, for each error type, the difference, ∆, between scores in the two datasets is calculated. Figure 4 plots these differences for the most common error types. We can observe that, for all error types, edits in CWEB-S result in both a lower degree of semantic change and sentence improvement than edits in FCE. This is particularly evident for the error types R:OTHER, R:SPELL and R:VERB. These are open class errors, where the error and correction can be quite different. It is therefore reasonable that differences in edits' degree of semantic change and perplexity improvement across domains are particularly observed in these cases. 22

Language Model Importance
We also investigate the degree to which systems can rely on a strong internal language model representation when evaluated against different domains. We examine this by looking at the performance of a purely language model based GEC system over the different datasets.
We build on the approach of Bryant and Briscoe (2018), using confusion sets to generate alternative 22 Score differences for the R:SPELL error type seem to be driven by a different propensity of spelling errors being of a typographical vs. phonetical nature in the two datasets.

False Positive Examples
Perplexity ratio All types of work are callings called to individuals.

0.34
Get started at with ACC 0.51 That is was actually kind of fun! 0.69 Table 8: Examples of false positives on the CWEB dataset that improve perplexity substantially -even more than the average gold edit in CWEB (0.86 perplexity ratio).
versions of an input sentence and then deciding if any of the alternatives are preferable to the original version, based on language model probabilities.
The authors use an n-gram language model, which we replace with GPT-2 (Radford et al., 2019) to see how a strong neural language model performsthis approach is similar to Alikaniotis and Raheja (2019). Hyperparameters are tuned for each dataset (see Appendix C for details). Table 7 displays the results on the different datasets. Recall and, in particular, precision is substantially lower on CWEB and AESW compared to other datasets. In general, scores are higher in domains with a higher proportion of errors and those containing edits which result in high perplexity improvements. In these cases systems can rely on a rough heuristic of replacing low probability sequences with high probability ones. However, in CWEB, where errors are fewer and more subtle, this leads to low precision, as perplexity alone cannot differentiate an erroneous sequence from a sequence that is rare but correct. Table 8 displays several examples of this, where false positive corrections suggested by the language model based GEC system have large perplexity improvements. This analysis suggests that the inability to rely on a strong internal language model representation can negatively impact SOTA system performance on CWEB and on low error density domains in general. This would mean that having large amounts of error examples for training is more important in highlevel domains.

Conclusion
We release a new GEC benchmark, CWEB, consisting of website text generated by English speakers at varying levels of proficiency. Comparisons against existing benchmarks demonstrate that CWEB differs in many respects: 1) in the distribution of sentences (higher vocabulary variation and named entity frequency); 2) in error density (lower); and 3)  Table 9: Scores of the GEC-PSEUDODATA system fine-tuned on CWEB data, calculated against both annotators.

C Language Model GEC Hyperparameter Tuning
A threshold, τ , determines the degree of probability improvement needed before an alternative sentence is preferred. For each dataset, we find τ , in the 0.9 to 1.0 range, resulting in the best development set F 0.5 . For CoNLL14, we tune on CoNLL13; for W&I, we use the dedicated training sets; for LOCNESS, there is no training set available and so we tune on the W&I subset of advanced texts (W&I-C).