Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

Until now, error type performance for Grammatical Error Correction (GEC) systems could only be measured in terms of recall because system output is not annotated. To overcome this problem, we introduce ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework. This not only facilitates error type evaluation at different levels of granularity, but can also be used to reduce annotator workload and standardise existing GEC datasets. Human experts rated the automatic edits as “Good” or “Acceptable” in at least 95% of cases, so we applied ERRANT to the system output of the CoNLL-2014 shared task to carry out a detailed error type analysis for the first time.


Introduction
Grammatical Error Correction (GEC) systems are often only evaluated in terms of overall performance because system hypotheses are not annotated. This can be misleading however, and a system that performs poorly overall may in fact outperform others at specific error types. This is significant because a robust specialised system is actually more desirable than a mediocre general system. Without an error type analysis however, this information is completely unknown.
The main aim of this paper is hence to rectify this situation and provide a method by which parallel error correction data can be automatically annotated with error type information. This not only facilitates error type evaluation, but can also be used to provide detailed error type feedback to non-native learners. Given that different corpora are also annotated according to different standards, we also attempted to standardise existing datasets under a common error type framework.
Our approach consists of two main steps. First, we automatically extract the edits between parallel original and corrected sentences by means of a linguistically-enhanced alignment algorithm (Felice et al., 2016) and second, we classify them according to a new, rule-based framework that relies solely on dataset-agnostic information such as lemma and part-of-speech. We demonstrate the value of our approach, which we call the ERRor ANnotation Toolkit (ERRANT) 1 , by carrying out a detailed error type analysis of each system in the CoNLL-2014 shared task on grammatical error correction (Ng et al., 2014).
It is worth mentioning that despite an increased interest in GEC evaluation in recent years (Dahlmeier and Ng, 2012;Felice and Briscoe, 2015;Bryant and Ng, 2015;Napoles et al., 2015;Grundkiewicz et al., 2015;Sakaguchi et al., 2016), ERRANT is the only toolkit currently capable of producing error types scores.

Edit Extraction
The first stage of automatic annotation is edit extraction. Specifically, given an original and corrected sentence pair, we need to determine the start and end boundaries of any edits. This is fundamentally an alignment problem: We took a guide tour on center city . We took a guided tour of the city center .
Table 1: A sample alignment between an original and corrected sentence (Felice et al., 2016).
The first attempt at automatic edit extraction was made by Swanson and Yamangil (2012), who simply used the Levenshtein distance to align parallel original and corrected sentences. As the Levenshtein distance only aligns individual tokens however, they also merged all adjacent nonmatches in an effort to capture multi-token edits. Xue and Hwa (2014) subsequently improved on Swanson and Yamangil's work by training a maximum entropy classifier to predict whether edits should be merged or not.
Most recently, Felice et al. (2016) proposed a new method of edit extraction using a linguistically-enhanced alignment algorithm supported by a set of merging rules. More specifically, they incorporated various linguistic information, such as part-of-speech and lemma, into the cost function of the Damerau-Levenshtein 2 algorithm to make it more likely that tokens with similar linguistic properties aligned. This approach ultimately proved most effective at approximating human edits in several datasets (80-85% F 1 ), and so we use it in the present study.

Automatic Error Typing
Having extracted the edits, the next step is to assign them error types. While Swanson and Yamangil (2012) did this by means of maximum entropy classifiers, one disadvantage of this approach is that such classifiers are biased towards their particular training corpora. For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NU-CLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf. Xue and Hwa (2014)). Instead, a dataset-agnostic error type classifier is much more desirable.

A Rule-Based Error Type Framework
To solve this problem, we took inspiration from Swanson and Yamangil's (2012) observation that most error types are based on part-of-speech (POS) categories, and wrote a rule to classify an edit based only on its automatic POS tags. We then added another rule to similarly differentiate between Missing, Unnecessary and Replace-ment errors depending on whether tokens were inserted, deleted or substituted. Finally, we extended our approach to classify errors that are not well-characterised by POS, such as Spelling or Word Order, and ultimately assigned all error types based solely on automatically-obtained, objective properties of the data.
In total, we wrote roughly 50 rules. While many of them are very straightforward, significant attention was paid to discriminating between different kinds of verb errors. For example, despite all having the same correction, the following sentences contain different types of common learner errors: (a) He IS asleep now.
[ While the final three rules could certainly be reordered, we informally found the above sequence performed best during development. It is also worth mentioning that this is a somewhat simplified example and that there are additional rules to discriminate between auxiliary verbs, main verbs and multi verb expressions. Nevertheless, the above case exemplifies our approach, and a more complete description of all rules is provided with the software.

A Dataset-Agnostic Classifier
One of the key strengths of a rule-based approach is that by being dependent only on automatic mark-up information, our classifier is entirely dataset independent and does not require labelled training data. This is in contrast with machine learning approaches which not only learn dataset specific biases, but also presuppose the existence of sufficient quantities of training data. A second significant advantage of our approach is that it is also always possible to determine precisely why an edit was assigned a particular error category. In contrast, human and machine learning classification decisions are often much less transparent.
Finally, by being fully deterministic, our approach bypasses bias effects altogether and should hence be more consistent.

Automatic Markup
The prerequisites for our rule-based classifier are that each token in both the original and corrected sentence is POS tagged, lemmatized, stemmed and dependency parsed. We use spaCy 3 v1.7.3 for all but the stemming, which is performed by the Lancaster Stemmer in NLTK. 4 Since fine-grained POS tags are often too detailed for the purposes of error evaluation, we also map spaCy's Penn Treebank style tags to the coarser set of Universal Dependency tags. 5 We use the latest Hunspell GB-large word list 6 to help classify non-word errors. The marked-up tokens in an edit span are then input to the classifier and an error type is returned.

Error Categories
The complete list of 25 error types in our new framework is shown in Table 2. Note that most of them can be prefixed with 'M:', 'R:' or 'U:', depending on whether they describe a Missing, Replacement, or Unnecessary edit, to enable evaluation at different levels of granularity (see Appendix A for all valid combinations). This means we can choose to evaluate, for example, only replacement errors (anything prefixed by 'R:'), only noun errors (anything suffixed with 'NOUN') or only replacement noun errors ('R:NOUN'). This flexibility allows us to make more detailed observations about different aspects of system performance.
One caveat concerning error scheme design is that it is always possible to add new categories for increasingly detailed error types; for instance, we currently label [could → should] a tense error, when it might otherwise be considered a modal error. The reason we do not call it a modal error, however, is because it would then become less clear how to handle other cases such as [can → should] and [has eaten → should eat], which might be considered a more complex combination of modal and tense error. As it is impractical to create new categories and rules to differentiate between such narrow distinctions however, our final framework aims to be a compromise between informativeness and practicality.

Classifier Evaluation
As our new error scheme is based solely on automatically obtained properties of the data, there are no gold standard labels against which to evaluate classifier performance. For this reason, we instead carried out a small-scale manual evaluation, where we simply asked 5 GEC researchers to rate the appropriateness of the predicted error types for 200 randomly chosen edits in context (100 from FCE-test and 100 from CoNLL-2014) as "Good", "Acceptable" or "Bad". "Good' meant the chosen type was the most appropriate for the given edit, "Acceptable" meant the chosen type was appropriate, but probably not optimum, while "Bad" meant the chosen type was not appropriate for the edit. Raters were warned that the edit boundaries had been determined automatically and hence might be unusual, but that they should focus on the appropriateness of the error type regardless of whether they agreed with the boundary or not.
It is worth stating that the main purpose of this evaluation was not to evaluate the specific strengths and weaknesses of the classifier, but rather ascertain how well humans believed the predicted error types characterised each edit. GEC is known to be a highly subjective task (Bryant and Table 3: The percent distribution for how each expert rated the appropriateness of the predicted error types. E.g. Rater 3 considered 83% of all predicted types to be "Good". Ng, 2015) and so we were more interested in overall judgements than specific disagreements. The results from this evaluation are shown in Table 3. Significantly, all 5 raters considered at least 95% of the predicted error types to be either "Good" or "Acceptable", despite the degree of noise introduced by automatic edit extraction. Furthermore, whenever raters judged an edit as "Bad", this could usually be traced back to a POS or parse error; e.g. [ring → rings] might be considered a NOUN:NUM or VERB:SVA error depending on whether the POS tagger considered both sides of the edit nouns or verbs. Interannotator agreement was also good at 0.724 κ f ree (Randolph, 2005).
In contrast, although incomparable on account of the different metric and error scheme, the best results using machine learning were between 50-70% F 1 (Felice et al., 2016). Ultimately however, we believe the high scores awarded by the raters validates the efficacy of our rule-based approach.

Error Type Scoring
Having described how to automatically annotate parallel sentences with ERRANT, we now also have a method to annotate system hypotheses; this is the first step towards an error type evaluation. Since no scorer is currently capable of calculating error type performance however (Dahlmeier and Ng, 2012;Felice and Briscoe, 2015;Napoles et al., 2015), we instead built our own.
Fortunately, one benefit of explicitly annotating system hypotheses is that it makes evaluation much more straightforward. In particular, for each sentence, we only need to compare the edits in the hypothesis against the edits in each respective reference and measure the overlap. Any edit with the same span and correction in both files is hence a true positive (TP), while unmatched edits in the hypothesis and references are false positives (FP) and false negatives (FN) respectively. These results can then be grouped by error type for the purposes of error type evaluation.
Finally, it is worth noting that this scorer is much simpler than other scorers in GEC which typically incorporate edit extraction or alignment directly into their algorithms. Our approach, on the other hand, treats edit extraction and evaluation as separate tasks.

Gold Reference vs. Auto Reference
Before evaluating an automatically annotated hypothesis against its reference, we must also address another mismatch: namely that hypothesis edits must be extracted and classified automatically, while reference edits are typically extracted and classified manually using a different framework. Since evaluation is now reduced to a straightforward comparison between two files however, it is especially important that the hypothesis and references are both processed in the same way. For instance, a hypothesis edit [have eating → has eaten] will not match the reference edits [have → has] and [eating → eaten] because the former is one edit while the latter is two edits, even though they equate to the same thing.
To solve this problem, we can reprocess the references in the same way as the hypotheses. In other words, we can apply ERRANT to the references such that each reference edit is subject to the same automatic extraction and classification criteria as each hypothesis edit. While it may seem unorthodox to discard gold reference information in favour of automatic reference information, this is necessary to minimise the difference between hypothesis and reference edits and also standardise error type annotations.
To show that automatic references are feasible alternatives to gold references, we evaluated each team in the CoNLL-2014 shared task using both types of reference with the M 2 scorer (Dahlmeier and Ng, 2012), the de facto standard of GEC evaluation, and our own scorer. Table 4 hence shows that there is little difference between the overall scores for each team, and we formally validated this hypothesis for precision, recall and F 0.5 by means of bootstrap significance testing (Efron and Tibshirani, 1993). Ultimately, we found no statistically significant difference  between automatic and gold references (1,000 iterations, p > .05) which leads us to conclude that our automatic references are qualitatively as good as human references.

Comparison with the M 2 Scorer
Despite using the same metric, Table 4 also shows that the M 2 scorer tends to produce slightly higher F 0.5 scores than our own. This initially led us to believe that our scorer was underestimating performance, but we subsequently found that instead the M 2 scorer tends to overestimate performance (cf. Felice and Briscoe (2015) and Napoles et al. (2015)).
In particular, given a choice between matching [have eating → has eaten] from Annotator 1 or [have → has] and [eating → eaten] from Annotator 2, the M 2 scorer will always choose Annotator 2 because two true positives (TP) are worth more than one. Similarly, whenever the scorer encounters two false positives (FP) within a certain distance of each other, 7 it merges them and treats them as one false positive; e.g. [is a cat → are a cats] is selected over [is → are] and [cat → cats] even though these edits are best handled separately. In other words, the M 2 scorer exploits its dynamic edit boundary prediction to artificially maximise true positives and minimise false positives and hence produce slightly inflated scores.  A dash indicates the team's system did not attempt to correct the given error type (TP+FP = 0).

CoNLL-2014 Shared Task Analysis
To demonstrate the value of ERRANT, we applied it to the data produced in the CoNLL-2014 shared task (Ng et al., 2014). Specifically, we automatically annotated all the system hypotheses and official reference files. 8 Although ERRANT can be applied to any dataset of parallel sentences, we chose to evaluate on CoNLL-2014 because it represents the largest collection of publicly available GEC system output. For more information about the systems in CoNLL-2014, we refer the reader to the shared task paper.

Edit Operation
In our first category experiment, we simply investigated the performance of each system in terms of Missing, Replacement and Unnecessary edits. The results are shown in Table 5 with additional information in Appendix B, Table 10. The most surprising result is that five teams (AMU, IPN, PKU, RAC, UFC) failed to correct any unnecessary token errors at all. This is noteworthy because unnecessary token errors account for roughly 25% of all errors in the CoNLL-2014 test data and so failing to address them significantly limits a system's maximum performance. While the reason for this is clear in some cases, e.g. UFC's rule-based system was never designed to tackle unnecessary tokens (Gupta, 2014), it is less clear in others, e.g. there is no obvious reason why AMU's SMT system failed to learn when 8 http://www.comp.nus.edu.sg/ ∼ nlp/conll14st.html to delete tokens (Junczys-Dowmunt and Grundkiewicz, 2014). AMU's result is especially remarkable given that their system still came 3rd overall despite this limitation.
In contrast, CUUI's classifier approach (Rozovskaya et al., 2014) was the most successful at correcting not only unnecessary token errors, but also replacement token errors, while CAMB's hybrid MT approach (Felice et al., 2014) significantly outperformed all others in terms of missing token errors. It would hence make sense to combine these two approaches, and indeed recent research has shown this improves overall performance (Rozovskaya and Roth, 2016). Table 6 shows precision, recall and F 0.5 for each of the error types in our proposed framework for each team in CoNLL-2014. As some error types are more common than others, we also provide the TP, FP and FN counts used to make this table in Appendix B, Table 11.

General Error Types
Overall, CAMB was the most successful team in terms of error types, achieving the highest Fscore in 10 (out of 24) error categories, followed by AMU, who scored highest in 6 categories. All but 3 teams (IITB, IPN and POST) achieved the best score in at least 1 category, which suggests that different approaches to GEC complement different error types. Only CAMB attempted to correct at least 1 error from every category.
Other interesting observations we can make from this  Table 6: Precision, recall and F 0.5 for each team and error type. A dash indicates the team's system did not attempt to correct the given error type (TP+FP = 0). The highest F-score for each type is highlighted.  Table 7: Detailed breakdown of Determiner errors for two teams.
• Despite the prevalence of spell checkers nowadays, many teams did not seem to employ them; this would have been an easy way to boost overall performance.
• Although several teams built specialised classifiers for DET and PREP errors, CAMB's hybrid MT approach still outperformed them. This might be because the classifiers were trained using a different error type framework however.
• CUUI's classifiers significantly outperformed all other approaches at ORTH and VERB:FORM errors. This suggests classifiers are well-suited to these error types.
• Although UFC's rule-based approach was the best at VERB:SVA errors, CUUI's classifier was not very far behind.
• Only AMU managed to correct any CONJ errors.
• Content word errors (i.e. ADJ, ADV, NOUN and VERB) were unsurprisingly very difficult for all teams.

Detailed Error Types
In addition to analysing general error types, the modular design of our framework also allows us to evaluate error type performance at an even greater level of detail. For example, Table 7 shows the breakdown of Determiner errors for two teams using different approaches in terms of edit operation. Note that this is a representative example of detailed error type performance, as an analysis of all error type combinations for all teams would take up too much space.  While CAMB's hybrid MT approach achieved a higher score than CUUI's classifier overall, our more detailed evaluation reveals that CUUI actually outperformed CAMB at Replacement Determiner errors. We also learn that CAMB scored twice as highly on M:DET and U:DET than it did on R:DET and that CUUI's significantly higher U:DET recall was offset by a lower precision. Ultimately, this shows that even though one approach might be better than another overall, different approaches may still have complementary strengths.

Multi Token Errors
Another benefit of explicitly annotating all hypothesis edits is that edit spans become fixed; this means we can evaluate system performance in terms of edit size. Table 8 hence shows the overall performance for each team at correcting multitoken edits, where a multi-token edit is an edit that has at least two tokens on either side. In the CoNLL-2014 test set, there are roughly 220 such edits (about 10% of all edits).
In general, teams did not do well at multi-token edits. In fact only three teams achieved scores greater than 10% F 0.5 and all of them used MT (AMU, CAMB, UMC). This is significant because recent work has suggested that the main goal of GEC should be to produce fluent-sounding, rather than just grammatical sentences, even though this often requires complex multi-token edits (Sakaguchi et al., 2016). If no system is particularly adept at correcting multi-token errors however, robust fluency correction will likely require more sophisticated methods than are currently available.

Detection vs. Correction
Another important aspect of GEC that is seldom reported in the literature is that of error detection; i.e. the extent to which a system can identify erroneous tokens in text. This can be calculated by comparing the edit overlap between the hypothesis and reference files regardless of the proposed correction in a manner similar to Recognition evaluation in the HOO shared tasks for GEC (Dale and Kilgarriff, 2011). Figure 1 hence shows how each team's score for detection differed in relation to their score for correction. While CAMB scored highest for detection overall, it is interesting to note that CUUI ultimately performed slightly better than CAMB at correction. This suggests CUUI was more successful at correcting the errors they detected than CAMB. In contrast, IPN and PKU are notable for detecting significantly more errors than they were able to correct. Nevertheless, a system's ability to detect errors, even if it is unable to correct them, is still likely to be valuable information to a learner (Rei and Yannakoudakis, 2016).
Finally, although we do not do so here, our scorer is also capable of providing a detailed error type breakdown for detection.

Conclusion
In this paper, we described ERRANT, a grammatical ERRor ANnotation Toolkit designed to au-tomatically annotate parallel error correction data with explicit edit spans and error type information. ERRANT can be used to not only facilitate a detailed error type evaluation in GEC, but also to standardise existing error correction corpora and reduce annotator workload. We release ERRANT with this paper.
Our approach makes use of previous work to align sentences based on linguistic intuition and then introduces a new rule-based framework to classify edits. This framework is entirely dataset independent, and relies only on automatically obtained information such as POS tags and lemmas. A small-scale evaluation of our classifier found that each rater considered >95% of the predicted error types as either "Good" (85%) or "Acceptable" (10%).