Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction

In this work, we study parameter tuning towards the M$^2$ metric, the standard metric for automatic grammar error correction (GEC) tasks. After implementing M$^2$ as a scorer in the Moses tuning framework, we investigate interactions of dense and sparse features, different optimizers, and tuning strategies for the CoNLL-2014 shared task. We notice erratic behavior when optimizing sparse feature weights with M$^2$ and offer partial solutions. To our surprise, we find that a bare-bones phrase-based SMT setup with task-specific parameter-tuning outperforms all previously published results for the CoNLL-2014 test set by a large margin (46.37% M$^2$ over previously 40.56%, by a neural encoder-decoder model) while being trained on the same data. Our newly introduced dense and sparse features widen that gap, and we improve the state-of-the-art to 49.49% M$^2$.


Introduction
Statistical machine translation (SMT), especially the phrase-based variant, is well established in the field of automatic grammatical error correction (GEC) and systems that are either pure SMT or incorporate SMT as system components occupied top positions in GEC shared tasks for different languages.
With the recent paradigm shift in machine translation towards neural translation models, neural encoder-decoder models are expected to appear in the field of GEC as well, and first published results (Xie et al., 2016) already mark the new state-of-theart for GEC.As it is the case in classical bilingual machine translation research, these models should be compared against strong SMT baselines.In this paper we attempt to provide these baselines.
During our first experiments, we find -to our surprise -that a bare-bones phrase-based system outperforms the best published results on the CoNLL-2014 test set by a significant margin only due to a task-specific parameter tuning scheme when being trained on the same data as these previous systems.When we further investigate the influence of well-known SMT-specific features and introduce new features adapted to the problem of GEC, our final systems outperform the best reported results by 9% M 2 , moving the state-of-the-art results for the CoNLL-2014 test set from 40.56% M 2 to 49.49%.
The paper is organized as follows: section 2 describes previous work, especially the CoNLL-2014 shared tasks on GEC and relevant follow-up papers.Our main contributions are presented in sections 3 and 4 where we investigate the interaction of parameter tuning towards the M 2 metric with task-specific dense and sparse features.Especially tuning for sparse features is more challenging than initially expected, but it seems that we found optimizer hyperparameters that make sparse feature weight tuning with M 2 feasible.Section 5 reports on the effects of adding a web-scale n-gram language model to our models, as it has been done in previous work.

The CoNLL-2014 Shared Task
While machine translation has been used for GEC in works as early as Brockett et al. (2006), we start our discussion with the CoNLL-2014 shared task arXiv:1605.06353v1[cs.CL] 20 May 2016 (Ng et al., 2014) where for the first time an unrestricted set of errors had to be fully corrected.Previous work, most notably during the CoNLL sharedtask 2013 (Ng et al., 2013), concentrated only on five selected errors types, but machine translation approaches (Yoshimoto et al., 2013;Yuan and Felice, 2013) were used as well.
The goal of the CoNLL-2014 shared task was to evaluate algorithms and systems for automatically correcting grammatical errors in essays written by second language learners of English.Grammatical errors of 28 types were targeted.Participating teams were given training data with manually annotated corrections of grammatical errors and were allowed to use additional publicly available data.
The corrected system outputs were evaluated blindly using the MaxMatch (M 2 ) metric (Dahlmeier and Ng, 2012).Thirteen system submissions took part in the shared task.Among the top-three positioned systems, two submissions -CAMB (Felice et al., 2014) and AMU (Junczys-Dowmunt and Grundkiewicz, 2014) -were partially or fully based on SMT.The second system, CUUI (Rozovskaya et al., 2014), was a classifierbased approach, another popular paradigm in GEC.

Aftermath
Shortly after the shared task, Susanto et al. ( 2014) published a work on GEC systems combinations.They combined the output from a classificationbased system and a SMT-based system using MEMT (Heafield and Lavie, 2010), reporting new state-of-the-art results for the CoNLL-2014 test set.Xie et al. (2016) present a neural networkbased approach to GEC.Their method relies on a character-level encoder-decoder recurrent neural network with an attention mechanism.They use data from the public Lang-8 corpus and combine their model with an n-gram language model trained on web-scale Common Crawl data.Adding synthesized erroneous data, they achieve the best published results for the CoNLL-2014 test set so far.
In Figure 1  make use of web-scale data, this corresponds to the data used in Xie et al. (2016).We marked the participants of the CoNLL-2014 shared task as unrestricted as some participants made use of Common Crawl data or Google n-grams.

Dense feature optimization
Moses comes with tools that can tune parameter vectors according to different MT tuning metrics.Prior work used Moses with default settings: minimum error rate training (Och, 2003) towards BLEU (Papineni et al., 2002).BLEU was never designed for grammatical error correction; we find that directly optimizing for M 2 works far better.

Tuning towards M 2
The M 2 metric (Dahlmeier and Ng, 2012) is an F-Score, based on the edits extracted from a Levenshtein distance matrix.For the CoNLL-2014 shared task, the β-parameter was set to 0.5, putting two times more weight on precision than on recall.
Junczys-Dowmunt and Grundkiewicz (2014) have shown that tuning with BLEU is counterproductive in a settings where M 2 is the evaluation metric.For inherently weak systems this can result in all correction attempts to be disabled, MERT then learns to disallow all changes since they lower the similarity to the reference as determined by BLEU.Systems with better training data, can be tuned with BLEU without suffering this "disabling" effect, but will reach non-optimal performance.However, Susanto et al. (2014) tune the feature weights of their two SMT-based systems with BLEU on the CoNLL-2013 test set and report state-of-the-art results.
W re-implemented the M 2 metric in C++ and added it as a scorer to the Moses parameter optimization framework.Due to this integration we can now tune parameter weights with MERT, PRO or Batch Mira.The inclusion of the latter two enables us to experiment with sparse features.
Based on Clark et al. (2011) concerning the effects of optimizer instability, we report results averaged over five tuning runs.Additionally, we compute parameter weight vector centroids as suggested by Cettolo et al. (2011).They showed that parameter vector centroids averaged over several tuning runs yield similar to or better than average results and reduce variance.We generally confirm this for M 2 -based tuning.

Dense features
The standard features in SMT have been chosen to help guiding the translation process.In a GEC setting the most natural units seem to be minimal edit operations that can be either counted or modeled in context with varying degrees of generalization.That way, the decoder can be informed on several levels of abstraction how the output differs from the input. 1  In this section we implement several features that try to capture these operation in isolation and in context.

Stateless features
Our stateless features are computed during translation option generation before decoding, modeling relations between source and target phrases.They are meant to extend the standard SMT-specific 1 We believe this is important information that currently has not yet been mastered in neural encoder-decoder approaches.Edit operation counts.We further refine Levenshtein distance feature with edit operation counts.
Based on the Levenshtein distance matrix, the numbers of deletions, insertions, and substitutions that transform the source phrase into the target phrase are computed, the sum of these counts is equal to the original Levenshtein distance(see Table 1 for examples).

Stateful features
Contrary to stateless features, stateful features can look at translation hypotheses outside their own span and take advantage of the constructed target context.The most typical stateful features are language models.In this section, we discuss LM-like features over edit operations.
Operation Sequence Model.Durrani et al. (2013) introduce Operation Sequence Models in Moses.These models are Markov translation models that in our setting can be interpreted as Markov edition models.Translations between identical words are matches, translations that have different words on source and target sides are substitutions; insertions and deletions are interpreted in the same way as for SMT.Gaps, jumps, and other operations typical for OSMs do not appear as we disabled reordering.Word-class language model.The monolingual Wikipedia data has been used create a 9-gram wordclass language model with 200 word-classes produced by word2vec (Mikolov et al., 2013).This features allows to capture possible long distance dependencies and semantical aspects.

Training and Test Data
The training data provided in both shared tasks is the NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013).NUCLE consists of 1,414 essays written by Singaporean students who are nonnative speakers of English.The essays cover a wide range of topics, such as environmental pollution, health care, etc.The grammatical errors in these essays have been hand-corrected by professional English teachers and annotated with one of the 28 predefined error type.Another 50 essays, collected and annotated similarly as NUCLE, were used in both CoNLL GEC shared tasks as blind test data.The CoNLL-2014 test set has been annotated by two human annotators, the CoNLL-2013 by one annotator.Many participants of CoNLL-2014 shared task used the test set from 2013 as development set for their systems.
As mentioned before, in order to make our results comparable to previous work, we report main results using similar training data as Susanto et al. (2014).We refer to this setting that as the "resticted-data setting" (r).Parallel data for translation model training is adapted from the above mentioned NUCLE corpus and the publicly available Lang-8 corpus (Mizumoto et al., 2012).Uncorrected sentences serve as source language data, corrected counterparts as target language data.For language modeling, the tar-get language sentences of both parallel resources are used, additionally we extract all text from the English version of Wikipedia.Table 2 lists all data sources and sizes.
Phrase-based SMT makes it ease to scale up in terms of training data, especially in the case of ngram language models.The CoNLL shared tasks did not impose any restrictions concerning training data (apart from having to be freely available) on its participants.As mentioned above, to demonstrate the ease of data integration we propose an "unrestricted setting" (u) based on the data used in Junczys-Dowmunt and Grundkiewicz (2014), one of the shared task submissions, and later in Xie et al. (2016).We use Common Crawl data made-available by Buck et al. (2014).

Experiments
Our system is based on the phrase-based part of the statistical machine translation system Moses (Koehn et al., 2007).Only plain text data is used for language model and translation model training.External linguistic knowledge is introduced during parameter tuning as the tuning metric relies on the error annotation present in NUCLE.The translation model is built with the standard Moses training script, word-alignment models are produced with MGIZA++ (Gao and Vogel, 2008), we restrict the word alignment training to 5 iterations of Model 1 and 5 iterations of the HMM-Model.No reordering models are used, the distortion limit is set to 0, effectively prohibiting any reordering.All systems use one 5-gram language model that has been estimated from the target side of the parallel data available for translation model training.Another 5-gram language model trained on Wikipedia in the restricted setting or on Common Crawl data in the unrestricted case.We combine the two language models as features in the log-linear model of Moses.
Systems are retuned when new features of any type are added.We first successfully reproduce results from Susanto et al. (2014) for BLEU-based tuning on the CoNLL-2013 test set as the development set (Fig. 2a) using similar training data.Repeated tuning places the scores reported by Susanto et al. (2014) for their   Tuning directly with M 2 (Fig. 2b) and averaging weights across five iterations, yields between 40.66% M 2 for a vanilla Moses system and 42.32% for a system with all described dense features.Results seen to be more stable.Averaging weight vectors across runs to produce the final vector seems like a fair bet.Performance with the averaged weight vectors is either similar to or better than the average number for five runs.
To emphasize: bare-bones Moses without any specialized features and with restricted data rivals the best reported systems on the CoNLL-2014 test set (40.56%) trained on unrestricted data.This is achieved alone due to M 2 -based tuning.Better tun-ing sets (next section), task-specific features, and more data (section 5) only increase that performance advantage.

Larger development sets
No less important than choosing the correct tuning metric is a good choice of the development set.Among MT researches, there is a number of more or less well known truths about suitable development sets for translation-focused settings: usually they consist of between 2000 and 3000 sentences, they should be a good representation of the testing data, sparse features require more sentences or more references, etc.Until now, we followed the seemingly obvious approach from Susanto et al. ( 2014) to tune on the CoNLL-2013 test set.The CoNLL-2013 test set consists of 1380 sentences, which might be barely enough for a translation-task, and it is unclear how to quantify it in the context of grammar correction.Furthermore, calculating the error rate in this set reveals that only 14.97% of the tokens are part of an erroneous fragment, for the rest, input and reference data are identical.Intuitively, this seems to be very little significant data for tuning an SMT system.
We therefore decide to take advantage of the entire NUCLE data as a development set which so far has only been used as translation model train- As can be seen in Fig. 2c, this procedure significantly improves performance, also for the barebones set-up (41.63%).The lower variance between iterations is an effect of averaging across folds.
It turns out that what was meant to be a strong baseline, is actually among the strongest systems reported for this task, outperformed only by the further improvements over this baseline presented in this work.

Sparse Features
We saw that introducing finer-grained edit operations improved performance.The natural evolution of that idea are features that describe specific cor- rection operations with and without context.This can be accomplished with sparse features, but tuning sparse features according to the M 2 metric poses unexpected problems.

Optimizing for M 2 with PRO and Mira
The MERT tool included in Moses cannot handle parameter tuning with sparse feature weights and one of the other optimizers available in Moses has to be used.We first experimented with both, PRO (Hopkins and May, 2011) and Batch Mira (Cherry and Foster, 2012), for the dense features only, and found PRO and Batch Mira with standard settings to either severely underperform in comparison to MERT or to suffer from instability with regard to different test sets (Table 3).
Experiments with Mira hyper-parameters allowed to counter these effects.We first change the background BLEU approximation method in Batch Mira to use model-best hypotheses (--model-bg) which seems to produce more satisfactory results.Inspecting the tuning process, however, reveals problems with this setting, too. Figure 3 reveals how instable the tuning process with Mira is across iterations.The best result is reached after only three iterations.In a setting with sparse features this would  After consulting with one of the authors of Batch-Mira, we set the background corpus decay rate to 0.001 (-D 0.001), resulting in a sentence-level approximation of M 2 .Mira's behavior seems to stabilize across iterations.At this point it is not quite clear why this is required.While PRO's behavior is more sane during tuning, results on the test sets are subpar.It seems that no comparable hyperparameter settings exist for PRO.

Sparse edit operations
Our sparse edit operations are again based on the Levenshtein distance matrix and count specific edits that are annotated with the source and target tokens that took part in the edit.For the following erroneous/corrected sentence pair we generate sparse features that model contextless edits (matches are omitted): subst(Then,Hence)=1 insert(,)=1 subst(comes, surfaces)=1 del(out)=1 and sparse features with one-sided left or right or two-sided context: <s>_subst(Then,Hence)=1 subst(Then,Hence)_a=1 Hence_insert(,)=1 insert(,)_a=1 problem_subst(comes, surfaces)=1 subst(comes, surfaces)_out=1 comes_del(out)=1 del(out)_.=1<s>_subst(Then,Hence)_a=1 Hence_insert(,)_a=1 problem_subst(comes, surfaces)_out=1 comes_del(out)_.=1 All sparse feature types are added on-top of our best dense-features system.When using sparse features with context, the contextless features are included.All the presented features are stateless, the context annotation comes from the erroneous source sentence, not from the corrected target sentence.We further investigate different source factors, elements taking part in the edit operation or appearing in the context can either be word forms (factor 0) or word classes (factor 1).As before for dense features we average sparse feature weights across folds and multiple tuning runs.
Figure 4 summarizes the results for our sparse feature experiments.On both test sets we can see significant improvements when including editbased sparse features, the performance increases even more when source context is added.The CoNLL-2013 test set contains annotations from only one annotator and is strongly biased towards high precision which might explain the greater instability.It appears that sparse features with context where surface forms and word-classes are mixed allow for the best fine-tuning.

Adding a web-scale language model
Until now we restricted our experiments to data used by Susanto et al. (2014).However, systems from the CoNLL-2014 were free to use any publicly available data, for instance Junczys-Dowmunt and Grundkiewicz (2014) make use of an n-gram language model trained from Common Crawl.Xie et al. (2016) reach the best published result for the task (before this work) by integrating a similar n-gram language model with their neural approach.
Table 4 summarizes the best results reported in this paper for the CoNLL-2014 test set (column 2014) before and after adding the Common Crawl n-gram language model.The vanilla Moses baseline with the Common Crawl model can be seen as a new simple baseline for unrestricted settings and is ahead of any previously published result.The combination of sparse features and web-scale monolingual data marks our best result, outperforming previously published results by 9% M2 (a relative improvement of 16%) using similar training data.While our sparse features cause a respectable gain when used with the smaller language model, the web-scale language model seems to cancel out part of the effect.Bryant and Ng (2015) extended the CoNLL-2014 test set with additional annotations from two to ten annotators.We report results for this valuable re-  Bryant and Ng (2015), human annotators seem to reach on average 72.58%M 2 which can be seen as an upperbound for the task.In this work, we made a large step towards this upperbound.

Conclusions
Despite the fact that statistical machine translation approaches are among the most popular methods in automatic grammatical error correction, few papers that report results for the CoNLL-2014 test set seem to have fully exploited its full potential.An important aspect, when training SMT systems, that one needs to tune model parameters towards the task evaluation metric, seems to have been underexplored.
We have shown that a bare-bones SMT system actually outperforms the best reported results for any paradigm in GEC if correct parameter tuning is performed.With this tuning mechanism available, taskspecific features have been explored that bring further significant improvements, putting phrase-based SMT ahead of other approaches by a large margin.None of the explored features require complicated pipelines or reranking mechanisms.Instead they are a natural part of the log-linear model in phrasebased SMT.It is therefore quite easy to reproduce our results and the presented systems may serve as new baselines for automatic grammatical error correction.

Figure 1 :
Figure 1: Comparison with previous results on the CoNLL-2014 test set.The two horizontal dashed lines mark performances for our bare-bones Moses system with restricted (r) and unrestricted (u) data.
Word-based Levenshtein distance (LD) feature and separated edit operations (D = deletions, I = insertions, S = substitutions) MLE-based phrase and word translation probabilities with meaningful phrase-level information about the correction process.Levenshtein distance.Junczys-Dowmunt and Grundkiewicz (2014) use word-based Levenshtein distance between source and target phrases as a translation model feature, Felice et al. (2014) independently experiment with a character-based version.
Figure 3: Results per iteration on development set (4-th NUCLE fold) Figure 4: Results on the CoNLL-2014 test set for different sparse features sets Err: Then a new problem comes out .Cor: Hence , a new problem surfaces .
within the range of possible values for a purely Moses-based system without any spe-Results on the CoNLL-2014 test set for different optimization settings (5 runs for each system) and different feature sets, the "All dense" entry includes OSM, the word class language model, and edit operations).The small circle marks results for averaged weights vectors and is chosen as the final result.

Table 4 :
Best results in restricted setting with added unrestricted language model for original (2014) and extended (2014-10) CoNLL test set source (column 2014-10) as well. 2 According to the