JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.


Introduction
Automatic grammatical error correction (GEC) progress is limited by the corpora available for developing and evaluating systems. Following the release of the test set of the CoNLL-2014 Shared Task on GEC (Ng et al., 2014), systems have been compared and new evaluation techniques proposed on this single dataset. This corpus has enabled substantial advancement in GEC beyond the shared tasks, but we are concerned that the field is over-developing on this dataset. This is problematic for two reasons: 1) it represents one specific population of language learners; and 2) the corpus only contains minimal edits, which correct the grammaticality of a sentence but do not necessarily make it fluent or native-sounding.
To illustrate the need for fluency edits, consider the example in Table 1. The correction with only minimal edits is grammatical but sounds awkward (unnatural to native speakers). The fluency correction has more extensive changes beyond addressing grammaticality, and the resulting sen-Original: they just creat impression such well that people are drag to buy it . Minimal edit: They just create an impression so well that people are dragged to buy it . Fluency edit: They just create such a good impression that people are compelled to buy it. tence sounds more natural and its intended meaning is more clear. It is not unrealistic to expect these changes from automatic GEC: the current best systems use machine translation (MT) and are therefore capable of making broader sentential rewrites but, until now, there has not been a gold standard against which they could be evaluated.
Following the recommendations of Sakaguchi et al. (2016), we release a new corpus for GEC, the JHU FLuency-Extended GUG corpus (JFLEG), which adds a layer of annotation to the GUG corpus (Heilman et al., 2014). GUG represents a cross-section of ungrammatical data, containing sentences written by English language learners with different L1s and proficiency levels. For each of 1,511 GUG sentences, we have collected four human-written corrections which contain holistic fluency rewrites instead of just minimal edits. This corpus represents the diversity of edits that GEC needs to handle and sets a gold standard to which the field should aim. We overview the current state of GEC by evaluating the performance of four leading systems on this new dataset. We analyze the edits made in JFLEG and summarize which types of changes the systems successfully make, and which they need to address. JFLEG will enable the field to move beyond minimal error corrections.

GEC corpora
There are four publicly available corpora of nonnative English annotated with corrections, to our  knowledge. The NUS Corpus of Learner English (NUCLE) contains essays written by students at the National University of Singapore, corrected by two annotators using 27 error codes (Dahlmeier et al., 2013). The CoNLL Shared Tasks used this data (Ng et al., 2014;, and the 1,312 sentence test set from the 2014 task has become de rigueur for benchmarking GEC. This test set has been augmented with ten additional annotations from Bryant et al. (2015) and eight from Sakaguchi et al. (2016). The Cambridge Learner Corpus First Certificate in English (FCE) has essays coded by one rater using about 80 error types, alongside the score and demographic information (Yannakoudakis et al., 2011). The Lang-8 corpus of learner English is the largest, with text from the social platform lang-8.com automatically aligned to user-provided corrections (Tajiri et al., 2012). Unlimited annotations are allowed per sentence, but 87% were corrected once and 12% twice. The AESW 2016 Shared Task corpus contains text from scientific journals corrected by a single editor. To our knowledge, AESW is the only corpus that has not been used to develop a GEC system. We consider NUCLE 1 and FCE to contain minimal edits, since the edits were constrained by error codes, and the others to contain fluency edits since there were no such restrictions. English proficiency levels vary across corpora: FCE and NUCLE texts were written by English language learners with relatively high proficiency, but Lang-8 is open to any internet user. AESW has technical writing by the most highly proficient English writers. Roughly the same percent of sentences from each corpus is corrected, except for FCE which has significantly more. This may be due to the rigor of the annotators and not the writing quality.
The following section introduces the JFLEG corpus, which represents a diversity of potential corrections with four corrections of each sentence. Unlike NUCLE and FCE, JFLEG does not restrict corrections to minimal error spans, nor are the er-rors coded. Instead, it contains holistic sentence rewrites, similar to Lang-8 and AESW, but contains more reliable corrections than Lang-8 due to perfect alignments and screened editors, and more extensive corrections than AESW, which contains fewer edits than the other corpora with a mean Levenshtein distance (LD) of 3 characters. Table 2 provides descriptive statistics for the available corpora. JFLEG is also the only corpus that provides corrections alongside sentence-level grammaticality scores of the uncorrected text.

The JFLEG corpus
Our goal in this work is to create a corpus of fluency edits, following the recommendations of (Sakaguchi et al., 2016), who identify the shortfalls of minimal edits: they artificially restrict the types of changes that can be made to a sentence and do not reflect the types of changes required for native speakers to find sentences fluent, or natural sounding. We collected annotations on a public corpus of ungrammatical text, the GUG (Grammatical/Ungrammatical) corpus (Heilman et al., 2014). GUG contains 3.1k sentences written by English language learners for the TOEFL R exam, covering a range of topics. The original GUG corpus is annotated with grammaticality judgments for each sentence, ranging from 1-4, where 4 is perfect or native sounding, and 1 incomprehensible. The sentences were coded by five crowdsourced workers and one expert. We refer to the mean grammaticality judgment of each sentence from the original corpus as the GUG score.
In our extension, JFLEG, the 1,511 sentences which comprise the GUG development and test sets were corrected four times each on Amazon Mechanical Turk. Annotation instructions are included in Table 3. 50 participants from the United States passed a qualifying task of correcting five sentences, which was reviewed by the authors (two native and one proficient non-native speakers of American English). Annotators also rated how difficult it was for them to correct each sentence on a 5-level Likert scale (5 being very easy and 1 very difficult). On average, the sentences were relatively facile to correct (mean difficulty of 3.5 ± 1.3), which moderately correlates with the GUG score (Pearson's r = 0.47), indicating that less grammatical sentences were generally more difficult to correct. To create a blind test set for the community, we withhold half (747) of the sentences from the analysis and evaluation herein.
Please correct the following sentence to make it sound natural and fluent to a native speaker of (American) English. The sentence is written by a second language learner of English. You should fix grammatical mistakes, awkward phrases, spelling errors, etc. following standard written usage conventions, but your edits must be conservative. Please keep the original sentence (words, phrases, and structure) as much as possible. The ultimate goal of this task is to make the given sentence sound natural to native speakers of English without making unnecessary changes. Please do not split the original sentence into two or more. Edits are not required when the sentence is already grammatical and sounds natural.  The mean LD between the original and corrected sentences is more than twice that of existing corpora (Table 2). LD negatively correlates with the GUG score (r = −0.41) and the annotation difficulty score (−0.37), supporting the intuition that less grammatical sentences require more extensive changes, and it is harder to make corrections involving more substantive edits. Because there is no clear way to quantify agreement between annotators, we compare the annotations of each sentence to each other. The mean LD between all pairs of annotations is greater than the mean LD between the original and corrected sentences (15 characters), however 36% of the sentences were corrected identically by at least two participants.
Next, the English L1 authors examined 100 randomly selected original and human-corrected sentence pairs and labeled them with the type of error present in the sentence and the type of edit(s) applied in the correction. The three error types are sounds awkward or has an orthographic or grammatical error. 2 The majority of the original sentences have at least one error (81%), and, for 68% of these sentences, the annotations are error free. Few annotated sentences have orthographic (4%) or grammatical (10%) errors, but awkward errors are more frequent (23% of annotations were labeled awkward)-which is not very surprising given how garbled some original sentences are and the dialectal variation of what sounds awkward.
The corrected sentences were also labeled with the type of changes made (minimal and/or fluency edits). Minimal edits reflect a minor change to a small span (1-2 tokens) addressing an immediate grammatical error, such as number agreement, tense, or spelling. Fluency edits are more holistic and include reordering or rewriting a clause, and other changes that involve more than two contiguous tokens. 69% of annotations contain at least one minimal edit, 25% a fluency edit, and 17% both fluency and minimal edits. The distribution of edit types is fairly uniform across the error type present in the original sentence (Table 4). Notably, fewer than half of awkward sentences were corrected with fluency edits, which may explain why so many of the corrections were still awkward.

Evaluation
To assess the current state of GEC, we collected automated corrections of JFLEG from four leading GEC systems with no modifications. They take different approaches but all use some form of MT. The best system from the CoNLL-2014 Shared Task is a hybrid approach, combining a rule-based system with MT and language-model reranking We evaluate system output against the four sets of JFLEG corrections with GLEU, an automatic fluency metric specifically designed for this task (Napoles et al., 2015) and the Max-Match metric (M 2 ) (Dahlmeier and Ng, 2012). GLEU is based on the MT metric BLEU, and represents the n-gram overlap of the output with the humancorrected sentences, penalizing n-grams that were been changed in the human corrections but left unchanged by a system. It was developed to score fluency in addition to minimal edits since it does not require an alignment between the original and corrected sentences. M 2 was designed to score minimal edits and was used in the CoNLL 2013 and 2014 shared tasks on GEC (Ng et al., 2013;Ng et al., 2014). Its score is the F 0.5 measure of word and phrase-level changes calculated over a lattice of changes made between the aligned origi- nal and corrected sentences. Since both GLEU and M 2 have only been evaluated on the CoNLL-2014 test set, we additionally collected human rankings of the outputs to determine whether human judgments of relative grammaticality agree with the metric scores when the reference sentences have fluency edits. The two native English-speaking authors ranked six versions of each of 150 JFLEG sentences: the four system outputs, one randomly selected human correction, and the original sentence. The absolute human ranking of systems was inferred using TrueSkill, which computes a relative score from pairwise comparisons, and we cluster systems with overlapping ranges into equivalence classes by bootstrap resampling (Sakaguchi et al., 2014;Herbrich et al., 2006). The two best ranked systems judged by humans correspond to the two best GLEU systems, but GLEU switches the order of the bottom two. The M 2 ranking does not perform as well, reversing the order of the top two systems and the bottom two (Table 5). 3 The upper bound is GLEU = 55.3 and M 2 = 63.2, the mean metric scores of each human correction compared to the other three. CAMB16 and NUS are halfway to the gold-standard performance measured by GLEU and, according to M 2 , they achieve approximately 80% of the human performance. The neural methods (CAMB16 and NUS) are substantially better than the other two according to both metrics. This ranking is in contrast to the ranking of systems on the CoNLL-14 shared task test against minimally edited references. On these sentences, AMU, which was tuned to M 2 , achieves the highest M 2 score with 49.5 and CAMB16, which was the best on the fluency corpus, ranks third with 39.9.
We find that the extent of changes made in the system output is negatively correlated to the qual- 3 No conclusive recommendation about the best-suited metric for evaluating fluency corrections can be drawn from these results. With only four systems, there is no significant difference between the metric rankings, and even the human rank has no significant difference between three systems.   ity as measured by GLEU (Figure 1). The neural systems have the highest scores for nearly all edit distances, and generate the most sentences with higher LDs. CAMB14 has the most consistent GLEU scores. The AMU scores of sentences with LD > 6 are erratic due to the small number of sentences it outputs with that extent of change.

Qualitative analysis
We examine the system outputs of the 100 sentences analyzed in Section 3, and label them by the type of errors they contain ( Figure 2) and edit types made (Table 6). The system rankings in Table 5 correspond to the rank of systems by the percent of output sentences with errors and the percent of error-ful sentences changed. Humans make significantly more fluency and minimal edits Original First , advertissment make me to buy some thing unplanly . Human First , an advertisement made me buy something unplanned .

AMU
First , advertissment makes me to buy some thing unplanly .

CAMB14
First , advertisement makes me to buy some things unplanly .

CAMB16
First , please let me buy something bad .

NUS
First , advertissment make me to buy some thing unplanly .
Original For example , in 2 0 0 6 world cup form Germany , as many conch wanna term work . Human For example , in the 2006 World Cup in Germany, many coaches wanted teamwork .

AMU
For example , in the 2 0 0 6 world cup from Germany , as many conch wanna term work .

CAMB14
For example , in 2006 the world cup from Germany , as many conch wanna term work .

CAMB16
For example , in 2006 the world cup from Germany , as many conch , ' work .

NUS
For example , in 2 0 0 6 World Cup from Germany , as many conch wanna term work . than any of the systems. The models with neural components, CAMB16 followed by NUS, make the most changes and produce fewer sentences with errors. Systems often change only one or two errors in a sentence but fail to address others. Minimal edits are the primary type of edits made by all systems (AMU and CAMB14 made one fluency correction each, NUS two, and CAMB16 five) while humans use fluency edits to correct nearly 30% of the sentences.
Spelling mistakes are often ignored: AMU corrects very few spelling errors, and even CAMB16, which makes the most corrections, still ignores misspellings in 30% of sentences. Robust spelling correction would make a noticeable difference to output quality. Most systems produce corrections that are meaning preserving, however, CAMB16 changed the meaning of 15 sentences. This is a downside of neural models that should be considered, even though neural MT generates the best output by all other measures.
The examples in Table 7 illustrate some of these successes and shortcomings. The first sentence can be corrected with minimal edits, and both AMU and CAMB14 correct the number agreement but leave the incorrect unplanly and the infinitival to. In addition, AMU does not correct the spelling of advertissement or some thing. CAMB16 changes the meaning of the sentence altogether, even though the output is fluent, and NUS makes no changes. The next set of sentences contains many errors and requires inference and fluency rewrites to correct. The human annotator deduces that the last clause is about coaches, not mollusks, and rewrites it grammatically given the context of the rest of the sentence. Systems handle the second clause moderately well but are unable to correct the final clause: only CAMB16 attempts to cor-rect it, but the result is nonsensical.

Conclusions
This paper presents JFLEG, a new corpus for developing and evaluating GEC systems with respect to fluency as well as grammaticality. 4 Our hope is that this corpus will serve as a starting point for advancing GEC beyond minimal error corrections. We described qualitative and quantitative analysis of JFLEG, and benchmarked four leading systems on this data. The relative performance of these systems varies considerably when evaluated on a fluency corpus compared to a minimal-edit corpus, underlining the need for a new dataset for evaluating GEC. Overall, current systems can successfully correct closed-class targets such as number agreement and prepositions errors (with incomplete coverage), but ignore many spelling mistakes and long-range context-dependent errors. Neural methods are better than other systems at making fluency edits, but this may be at the expense of maintaining the meaning of the input. As there is still a long way to go in approaching the performance of a human proofreader, these results and benchmark analyses help identify specific issues that GEC systems can improve in future research.