Human Evaluation of Grammatical Error Correction Systems

The paper presents the results of the ﬁrst large-scale human evaluation of automatic grammatical error correction (GEC) systems. Twelve participating systems and the unchanged input of the CoNLL-2014 shared task have been reassessed in a WMT-inspired human evaluation procedure. Methods introduced for the Work-shop of Machine Translation evaluation campaigns have been adapted to GEC and extended where necessary. The produced rankings are used to evaluate standard metrics for grammatical error correction in terms of correlation with human judgment.


Introduction
The field of automatic grammatical error correction (GEC) has seen a number of shared tasks of different scope and for different languages. The most impactful were the CoNLL-2013and CoNLL-2014(Ng et al., 2013Ng et al., 2014) shared tasks on Grammatical Error Correction for ESL (English as a second language) learners. They were preceded by the HOO shared tasks (Dale and Kilgarriff, 2011;Dale et al., 2012). Shared tasks for other languages took place as well, including the QALB workshops for Arabic (Mohit et al., 2014) and NLP-TEA competitions for Chinese. These tasks use automatic metrics to determine the quality of the participating systems.
However, these efforts pale in comparison to competitions organized in other fields, e.g. during the annual Workshops for Machine Translation (WMT). It is a central idea of the WMTs that automatic measures of machine translation quality are an imperfect substitute for human assessments. Therefore, manual evaluation of the system outputs are conducted and their results are reported as the final rankings of the workshops. These human evaluation campaigns are an important driving factor for the advancement of MT and produce insightful "by-products", such as a huge number of human assessments of machine translation outputs that have been used to evaluate automatic metrics.
We believe that the unavailability of this kind of quality assessment may stall the development of GEC, as all the shared tasks and the entire field have to cope with an inherent uncertainty of their methods and metrics. We hope to make a step towards alleviating this lack of confidence by presenting the results of the first 1 large-scale human evaluation of automatic grammatical error correction systems submitted to the CoNLL-2014 shared task. Most of our inspiration is drawn from the recent WMT edition  and its metrics task (Macháček and Bojar, 2014).
We also provide an analysis of correlation between the standard metrics in GEC and human judgment and show that the commonly used parameters for standard metrics in the shared task may not be optimal. The uncertainty about metrics quality leads to proposals of new metrics, with Felice and Briscoe (2015) being a recent example. Based on human judgments we can show that this proposed metric maybe less useful than hoped.
2 Evaluation of GEC systems Madnani et al. (2011) addresses two problems of GEC evaluation: 1) a lack of informative metrics and 2) an inability to directly compare the performance of systems developed by different researchers. Two evaluation methodologies are presented, both based on crowdsourcing which are used to grade types of errors rather than system performance as presented in this work. Chodorow et al. (2012) draw attention to the many evalua-tion issues in error detection which make it hard to compare different approaches. The lack of consensus is due to the nature of the error detection task. The authors argue that the choice of the metric should take into account factors such as the skew of the data and the application that the system is used for.
The most recent addition is Felice and Briscoe (2015) who present a novel evaluation method for grammatical error correction that scores systems in terms of improvement on the original text.

The CoNLL-2014 shared task
The goal of the CoNLL-2014 shared task (Ng et al., 2014) was to evaluate algorithms and systems for automatically correcting grammatical errors in English essays written by second language learners of English. Training and test data was annotated with 28 error types. Participating teams were given training data with manually annotated corrections of grammatical errors and were allowed to use publicly available resources for training.
Twenty-five student non-native speakers of English were recruited to write essays to be used as test data. Each student wrote two essays. The 50 test essays were error-annotated by two English native speakers. The essays and error annotations were made available after the task. The MaxMatch (M 2 ) scorer (Dahlmeier and Ng, 2012) has been used as the official shared task evaluation metric.
4 Data collection 4.1 Sampling sentences for evaluation The system outputs of the CoNLL-2014 shared task serve as evaluation data. The test set consists of 1312 sentences, there are twelve system outputs available. The thirteenth participant NARA is missing from this set. However, in GEC evaluation there is also the input to consider. Often system outputs are equal to the unmodified input, as it is most desirable if there are in fact no errors. We include INPUT as the thirteenth system.
Due to the small number of modifications that GEC systems apply to the input, there is not only a large overlap with the input, but also among all systems ( Figure 1). If we sample systems uniformly, we lose easily obtainable pairwise judgments for systems with the same output, and if we collapse before sampling we introduce a strong bias towards ties. To counter that bias, we abandon uniform sampling of test set sentences and use instead a parametrized distribution that favors diverse sets of outputs.
The probability p i for a set of outputs O i is calculated as follows: N is the number of systems to be evaluated, M is the maximum number of sentences presented to the evaluator in a single ranking (we use M = 5). The set of system outputs to be evaluated E = {O 1 , . . . , O n } ∀ 1≤i≤n |O i | = N , consists of n (= 1312) sets O i of N output sentences each. Every sentence in O i can overlap with other sentences multiple times, so for each set O i we define the corresponding multiset of multiplicities U i , such that u∈U i u = N .
We define c i (j) as the number of possible ways to choose at most M different sentences that cover j systems for the i-th set of outputs: Then the expected number C i of systems covered by choosing at most M sentences is . The pseudo-probability p i of sampling the i-th sentence is defined as which is the ratio of pairwise comparisons of M versus C i different systems. By normalizing over  the entire set of output sets we obtain the probability p i of sampling the i-th set of outputs as

Collecting system rankings
The sets of outputs sampled with the described method have been prepared for Appraise (Federmann, 2010) and presented to the judges. Judges were asked to rank sentences from best to worse. Ties are allowed. Judges were aware that the absolute ranks bear no relevance as ranks are later turned into relative pairwise judgments. No notion of "better" or "worse" was imposed by the authors, we relied on the judges to develop their own intuition. All eight judges are English native speakers and have extensive backgrounds in linguistics. Figure 2a displays a screen shot of Appraise with a judged sentence. Several modifications to the Appraise framework 2 were implemented to account for the specific nature of GEC: Only the input sentence is displayed (top, bold), no reference correction is given. The input sentence is surrounded by one preceding and one fol-lowing sentence. Identical corrections are collapsed into one output, system names with the same output are recorded internally. Edited fragments are highlighted, blue for insertions and substitutions, pale blue and crossed-out for deletions.

Pairwise judgments
As conducted during the WMT campaigns, we turn rankings into sets of relative judgments of the form A>B, A=B, A<B where the lower ranked system scores a win. Absolute ranks and differences are lost. As mentioned above, due to the collapsing of identical outputs we obtain significantly more data than the usual 10 pairs from one ranking with five sentences. Figure 2a contains a ranking with overlapping outputs as displayed in the top graph of Figure 2b. Pairs from within overlaps result in ties, pairs between overlaps are expanded as products, 6 2 = 15 pairwise judgments can be extracted. Greater overlap leads to more pairwise judgments (bottom, 13 2 = 78). Table 1 lists the full statistics for collected rankings by individual annotators. Unexpanded pairs are WMT-style pairwise judgments before an output A gets split into overlapping systems A 1 , A 2 , A 3 , etc. The large number of ties for expanded pairs is to be expected due to the high overlap between systems (on average there are only 5.7 unique outputs among 13 systems).

Inter-and intra-annotator agreement
Again inspired by the WMT evaluation campaigns, we compute annotator agreement as a measure of reliability of the pairwise judgments with Cohen's kappa coefficient (Cohen, 1960): where P (A) is the proportion of times that annotators agree, and P (E) is the proportion of times that they would agree by chance. κ assumes values from 0 (no agreement) to 1 (perfect agreement). All probabilities are computed as ratios of empirically counted pairwise judgments. As the judges worked on collapsed outputs, we calculate agreement scores for unexpanded pairs; otherwise, the high overlap would unfairly increase agreement.
P (A) is calculated by examining all pairs of outputs which have been judged by two or more judges, and counting the proportion of times that they agreed that A<B, A=B, or A>B.
P (E) = P (A<B) 2 +P (A=B) 2 +P (A>B) 2 is the probability that two judges agree randomly. Intraannotator agreement as a measure of consistency is calculated for output sets that have been judged more than one time by the same annotator.
The agreement numbers in Table 2 are in the lower range of values reported during WMT. However, it should be noted that judges never saw the repeated outputs within one ranking which probably decreases agreement compared to the MTspecific task.

Computing ranks
In this section, it is our aim to produce a system ranking from best to worse by computing the average number of times each system was judged better than other systems based on the collected pairwise rankings. While previously introduced methods for producing rankings, total orderings, as well as partial orderings at chosen confidencelevels, can be directly applied to our data, determining which ranking is more accurate turns out to be methodologically and computationally more involved due to the specific nature of GEC outputs.

Ranking methods
We adapt two ranking methods applied during WMT13 and WMT14 to GEC evaluation: the Expected Wins method and a version of TrueSkill.
Expected Wins. Expected Wins (EW) has been introduced for WMT13  and is based on an underlying model of "relative ability" proposed in Koehn (2012). One advantage of this method is its intuitiveness; the scores reflect the probability that a system S i will be ranked better than another system that has been randomly chosen from a pool of opponents {S j : j = i}. Defining the function win(A, B) as the number of times system A is ranked better than system B, Bojar et TrueSkill. The TrueSkill ranking system (Herbrich et al., 2007) is a skill based ranking system for Xbox Live developed at Microsoft Research. It is used to identify and model player (GEC systems in our case) ability in a game to assign players to competitive matches. The TrueSkill ranking system models each player S i by two parameters: the average relative ability µ S i and the degree of un-certainty in the player's ability σ 2 S i . Maintaining uncertainty allows TS to make greater changes to the ability estimates at the beginning and smaller changes after a number of consistent matches has been played. Due to that TS can identify the ability of individual players from a smaller number of pairwise comparisons.
A modification of this approach to the WMT manual evaluation procedure by Sakaguchi et al. (2014) has been adopted as the official ranking method during WMT14 replacing EW. The TrueSkill scores are calculated as inferred means:

Rank clusters
Both ranking methods produce total orderings without information on the statistical significance of the obtained ranks.  notice that the similarity of the participants in terms of methods and training data causes some of them to be very similar and group systems into equivalence classes as proposed by Koehn (2012).
Although the methods and training data among the systems examined in this paper are quite diverse, a great similarity of produced outputs is an inherent property of GEC. Therefore, in this section, for each system S j placed on rank r j we also try to determine the true systems rank ranges [r j , . . . , r j ] at a confidence-level of 95% and clusters of equivalent systems by following the procedure outlined by Koehn (2012). This is accomplished by applying bootstrap resampling. Pairwise rankings are drawn from the set of judgments with multiple drawings. Based on this sample a new ranking is produced. After repeating this process a 1000 times the obtained 1000 ranks for S j are sorted, with the top 25 and bottom 25 ranks being discarded. The interval of the remaining ranks serves as the final rank range. Next, these rank ranges are used to produce clusters of overlapping rank ranges. This is the last step required to produce the rankings in Tables 3b  and 3c for both methods, EW and TS, respectively.

Choosing the final ranking
Now, we face the question which ranking should be presented as the final result of the human evaluation task. Again, we turn to  who choose their rankings based on the ranking model's ability to predict pairwise rankings. Accuracy is computed by 100-fold cross-validation. For each fold a new ranking is trained from 99 parts with the left-over part serving as test data.
In a first step, we calculate the accuracy of the unclustered total orderings discarding ties. A ranking based on model scores alone cannot predict ties, this requires equivalence classes.  define a draw radius r such that systems whose scores differ by less than r are assigned to one cluster, r is tuned to maximize accuracy.
In our case, due to the large number of ties, their method of tuning r is trapped in local maxima and assigns all systems to a single cluster. Alternatively, we propose to calculate clusters according to the method described in the previous section.  Table 4: Accuracy for ranking-based prediction of pairwise judgments.
By fixing p ≤ 0.05 we directly evaluate rankings of the form given in Table 3. The absolute values of scores and their different interpretations between methods become irrelevant which makes it unnecessary to tune a parameter like r. The main drawback of this approach is its computational cost. For each of the 100 folds we bootstrap another 100 rankings with EW and TS, fix p ≤ 0.05 and calculate rank clusters. The single clustered ranking for each fold is then used to calculate accuracy for the held-out test data. For our data, contrary to the MT-specific results from , EW beats TS in both cases (Table 4). We therefore present the ExpectedWins-based ranking (Table 3b) as the final result of the human evaluation effort described in this work and refer to it in the remainder of the paper when the human ranking is mentioned.

Analysis
The final human-created ranking (Table 3b) consists of four non-overlapping rank clusters. Rank ranges have been calculated at a confidence level of 95%. Comparing the official CoNLL-2014 ranking (Table 3a) with the manually created Ex-pectedWins ranking shows interesting differences.
The AMU system is judged to be a clear leader by human judges in its own rank cluster. For six out of eight judges, AMU has the highest score ( Table 7). The officially winning system CAMB occupies third place in terms of EW scores and is placed in the second cluster with four systems. Only one judge put CAMB in first place. RAC, a middling system, is elevated to second place occupying a rank cluster with three other systems. NTHU, another middling system that based on M 2 should be similar to RAC, is put in the second to last position. Two systems are judged to be worse than INPUT. The rank cluster that includes INPUT is the largest among the four clusters.
We also include pairwise comparisons between all systems according to EW in Table 3d. Each cell contains the percentage of times the system in that column was judged to be better than the system in that row. Bold values mark the winner. We applied the Sign Test to measure statistically significant differences, indicates statistical significance at p ≤ 0.10, † at p ≤ 0.05, and ‡ at p ≤ 0.01.

Correlation with GEC metrics
Since WMT08 (Callison-Burch et al., 2008) the "metrics task" has been part of the WMT. The aim of the metrics task is to assess the quality of automatic evaluation metrics for MT in terms of correlation with the collected human judgments. We attempt the same in the context of GEC.

Measures of correlation
Based on Macháček and Bojar (2013), we use Spearman's rank correlation ρ and Pearson's r to compare the similarity of rankings produced by various metrics to the manual ranking from the previous section.
Spearman's rank correlation ρ. Spearman's ρ for rankings with no ties is defined as where d i is the distance between human and metric rank for system i, n is the number of systems.
Pearson's r. Macháček and Bojar (2013) find that Spearman's ρ is too harsh and propose to also use Pearson's r, calculated as where H and M are the vectors of human and metric scores,H andM are corresponding means.
I-measure/Weighted Accuracy (I-WAcc). The recently proposed I-WAcc metric (Felice and Briscoe, 2015) tries to address the shortcomings of M 2 . The inclusion of true negatives into the formula makes this a very conservative metric; quite similar to the MT metrics described below. The metric assigns negative weights to systems that are harmful with regard to the input text, values from the range [1, −1] are possible. The reported correlation values have been calculated for the ranking presented in Felice and Briscoe (2015).
Machine translation evaluation metrics. Basing most of our results on findings from MT, we also take a look at two machine translation evaluation metrics, BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2011). In order to use the CoNLL-2014 gold standard with these metrics, the edit-based annotation has been converted into two plain text files, one per annotator.

Analysis
The correlation results are collected in table 5. The M 2 metric is generally moderately correlated with human judgment and is on the brink of high correlation for values of β closer to 0.2. Compared to M 2 , the other metrics are weakly or moderately inversely correlated to human judgment. Inverse correlation with human judgments for metrics that all assign higher scores to better systems seems problematic. In the case of I-WAcc, we would go as far as to state an absence of correlation. It seems the conservative approach adopted for I-WAcc does not correspond to the notion of quality that our judges worked out for themselves. The switch to β = 0.5 from β = 1.0 for the CoNLL-2014 shared task was a good choice, but a higher correlation can be achieved for β = 0.25, the maximum is reached for β = 0.18. Correlation drops sharply for β = 0.1. The lack of positive correlation for the MT-metrics is interesting in the light of improvement that results from a shift towards precision for M 2 as BLEU is based on precision. Figure 3 contains detailed plots of ρ and r with regard to β within the [0, 1] range. As the CoNLL-2014 test data included edits from two annotators, we plot curves for both annotators separately and for the combined gold standard. In the case of Spearman's ρ having alternative error annotations, this leads to higher correlation values. Based on the plots we would recommend setting 0.2 ≤ β ≤ 0.3 instead of 0.5 or even 1.0.
Inter-annotator correlations of rankings computed for individual judges (Table 6) can be treated as human-level upper bounds for metric correlation. The penultimate column and row contain correlations of rankings for individual judges with rankings computed from all judges minus the respective judge. The last column and row contain the respective weighted (w.r.t. judgments per judge) average of these correlations.

Conclusions and future work
We have successfully adapted methods from the WMT human evaluation campaigns to automatic grammatical error correction. The collected and produced data has been made available and should be useful for other researchers. Although we set out to provide answers, we probably ended up with more questions. The following (and more) might be investigated in the future: What makes the winning system special and why do the standard metrics fail at identifying this system? Can we come up with better system-level metrics? Can meaningful sentence-level metrics be developed?
Outside the scope of the particular data, we need to wonder if our results generalize to other shared tasks and other languages. The CoNLL-2014 data concerns ESL learners only and may not be transferable to systems for native speakers. This would be in line with the ideas developed by Chodorow et al. (2012). We would hope to see similar endeavors for the other shared tasks as this would enable the field to draw more general conclusions.

Obtaining the data
The presented data and tools are available from: https://github.com/grammatical/ evaluation