Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation

We combine two of the most popular approaches to automated Grammatical Error Correction (GEC): GEC based on Statistical Machine Translation (SMT) and GEC based on Neural Machine Translation (NMT). The hybrid system achieves new state-of-the-art results on the CoNLL-2014 and JFLEG benchmarks. This GEC system preserves the accuracy of SMT output and, at the same time, generates more fluent sentences as it typical for NMT. Our analysis shows that the created systems are closer to reaching human-level performance than any other GEC system reported so far.


Introduction
Currently, the most effective GEC systems are based on phrase-based statistical machine translation (Rozovskaya and Roth, 2016;Junczys-Dowmunt and Grundkiewicz, 2016;Chollampatt and Ng, 2017). Systems that rely on neural machine translation (Yuan and Briscoe, 2016;Xie et al., 2016;Schmaltz et al., 2017;Ji et al., 2017) are not yet able to achieve as high performance as SMT systems according to automatic evaluation metrics (see Table 1 for comparison on the CoNLL-2014 test set). However, it has been shown that the neural approach can produce more fluent output, which might be desirable by human evaluators (Napoles et al., 2017). In this work, we combine both MT flavors within a hybrid GEC system. Such a GEC system preserves the accuracy of SMT output and at the same time generates more fluent sentences achieving new state-of-the-art results on two different benchmarks: the annotationbased CoNLL-2014 and the fluency-based JFLEG benchmark. Moreover, comparison with human gold standards shows that the created systems are closer to reaching human-level performance than any other GEC system described in the literature so far. Using consistent training data and preprocessing ( § 2), we first create strong SMT ( § 3) and NMT ( § 4) baseline systems. Then, we experiment with system combinations through pipelining and reranking ( § 5). Finally, we compare the performance with human annotations and identify issues with current state-of-the-art systems ( § 6).

Data and preprocessing
Our main training data is NUCLE (Dahlmeier et al., 2013). English sentences from the publicly available Lang-8 Corpora (Mizumoto et al., 2012) serve as additional training data.
We use official test sets from two CoNLL shared tasks from 2013(Ng et al., 2013, 2014 as development and test data, and evaluate using M 2 (Dahlmeier and Ng, 2012). We also report results on JFLEG (Napoles et al., 2017)   GLEU metric (Napoles et al., 2015). The data set is provided with a development and test set split. All data sets are listed in Table 1.
We preprocess Lang-8 with the NLTK tokenizer (Bird and Loper, 2004) and preserve the original tokenization in NUCLE and JFLEG. Sentences are truecased with scripts from Moses (Koehn et al., 2007). For dealing with out-of-vocabulary words, we split tokens into 50k subword units using Byte Pair Encoding (BPE) by Sennrich et al. (2016b). BPE codes are extracted only from correct sentences from Lang-8 and NUCLE.

SMT systems
For our SMT-based systems, we follow recipes proposed by Junczys-Dowmunt and Grundkiewicz (2016), and use a phrase-based SMT system with a log-linear combination of task-specific features. We use word-level Levenshtein distance and edit operation counts as dense features (Dense), and correction patterns on words with one word left/right context on Word Classes (WC) as sparse features (Sparse). We also experiment with additional character-level dense features (Char. ops). All systems use a 5-gram Language Model (LM) and OSM (Durrani et al., 2011) both estimated from the target side of the training data, and a 5-gram LM and 9-gram WCLM trained on Common Crawl data (Buck et al., 2014).
Experiment settings Translation models are trained with Moses (Koehn et al., 2007), wordalignment models are produced with MGIZA++ (Gao and Vogel, 2008), and no reordering models are used. Language models are built using KenLM (Heafield, 2011), while word classes are trained with word2vec 1 .
We tune the systems separately for M 2 and GLEU metrics. MERT (Och, 2003)   we follow the 4-fold cross-validation on NUCLE with adapted error rate recommended by Junczys-Dowmunt and Grundkiewicz (2016). Models evaluated on GLEU are optimized on JFLEG Dev using the GLEU scorer, which we added to Moses. We report results for models using feature weights averaged over 4 tuning runs.
Results Other things being equal, using the original tokenization, applying subword units, and extending edit-based features result in a similar system to Junczys-Dowmunt and Grundkiewicz (2016): 49.82 vs 49.49 M 2 ( Table 2).
The phrase-based SMT systems do not deal well with orthographic errors (Napoles et al., 2017) if a source word has not been seen in the training corpus, it is likely copied as a target word. Subword units can help to solve this problem partially. Adding features based on character-level edit counts increases the results on both test sets.
A result of 55.79 GLEU on JFLEG Test is already 2 points better than the GLEU-tuned NMT system of  and only 1 point worse than the best reported result by Chollampatt and Ng (2017) with their M 2 -tuned SMT system, even though no additional spelling correction has been used at this point. We experiment with specialized spell-checking methods in later sections.

NMT systems
The model architecture we choose for our NMTbased systems is an attentional encoder-decoder model with a bidirectional single-layer encoder and decoder, both using GRUs as their RNN variants (Sennrich et al., 2017). A similar architecture has been already tested for the GEC task by , but we use different hyperparameters.
To improve the performance of our NMT models, similarly to Xie et al. (2016) and Ji et al. (2017), we combine them with an additional large-scale language model. In contrast to previous studies, which use an n-gram probabilistic LM, we build a 2-layer Recurrent Neural Network Language Model (RNN  Table 3: Results for NMT systems on the CoNLL-2014 (M 2 ) and JFLEG Test (GLEU) sets.
LM) with GRU cells which we train again on English Common Crawl data (Buck et al., 2014).
Experimental settings We train with the Marian toolkit (Junczys-Dowmunt et al., 2018) on the same data we used for the SMT baselines, i.e. NUCLE and Lang-8. The RNN hidden state size is set to 1024, embedding size to 512. Source and target vocabularies as well as subword units are the same.
Optimization is performed with Adam (Kingma and Ba, 2014) and the mini-batch size fitted into 4GB of GPU memory. We regularize the model with scaling dropout (Gal and Ghahramani, 2016) with a dropout probability of 0.2 on all RNN inputs and states. Apart from that we dropout entire source and target words with probabilities of 0.2 and 0.1 respectively. We use early stopping with a patience of 10 based on the cross-entropy cost on the CoNLL-2013 test set. Models are validated and saved every 10,000 mini-batches. As final models we choose the one with the best performance on the development set among the last ten model check-points based on the M 2 or GLEU metrics.
Size of RNN hidden state and embeddings, target vocabulary, and optimization options for the RNN LM are identical to those used for our NMT models. Decoding is done by beam search with a beam size of 12. We normalize scores for each hypothesis by sentence length.
Results A single NMT model achieves lower performance than the SMT baselines (Table 3). However, the M 2 score of 42.76 for CoNLL-2014 is already higher than the best published result of 41.53 M 2 for a strictly neural GEC system of Ji et al. (2017) that has not been enhanced by an additional language model. Our RNN LM is integrated with NMT models through ensemble decoding (Sennrich et al., 2016a). Similarly to Ji et al. (2017), we choose the weight of the language model using grid search on the development set 2 . This strongly improves recall, and thus boosts the results significantly on both test sets (+5.8 M 2 and +5.96 GLEU). An ensemble of four independently trained models 3 (NMT×4), on the other hand, increases precision at the expense of recall, which may even lead to a performance drop. Adding the RNN LM to that ensemble balances this negative effect, resulting in 50.19 M 2 . These are by far the highest results reported on both benchmarks for pure neural GEC systems.
Comparison to SMT systems With model ensembling, the neural systems achieve performance similar to SMT baselines ( Figure 2). A strippeddown SMT system without CCLM, quite surprisingly gives better results on JFLEG than the NMT system, and the opposite is true for CoNLL-2014. The reason for the lower performance on JFLEG might be a large amount of spelling errors, which are more efficiently corrected by the SMT system using subword units.
If both systems are enhanced by a large-scale language model, the neural system outperforms the SMT system on JFLEG and it is competitive with SMT systems on CoNLL-2014. However, it is not known if the results would preserve if the NMT model is combined with a probabilistic ngram LM instead as it has been proposed in the previous works (Xie et al., 2016;Ji et al., 2017

Hybrid SMT-NMT systems
We experiment with pipelining and rescoring methods in order to combine our best SMT and NMT GEC systems 4 .

SMT-NMT pipelines
The output corrected by an SMT system is passed as an input to the NMT ensemble with or without RNN LM 5 . In this case the NMT system serves as an automatic post-editing system. Pipelining improves the results on both test sets by increasing recall (Table 4). As the performance of the NMT system without a RNN LM is much lower than the performance of the SMT system alone, this implies that both approaches produce complementary corrections.
Rescoring with NMT Rescoring of an n-best list obtained from one system by another is a commonly used technique in GEC, which allows to combine multiple different systems or even different approaches (Hoang et al., 2016;Yannakoudakis et al., 2017;Chollampatt and Ng, 2017;Ji et al., 2017). In our experiments, we generate a 1000 n-best list with the SMT system and add separate scores from each neural component. Scores of NMT models and the RNN LM are added in the form of probabilities in negative log space. The re-scored weights are obtained from a single run of the Batch Mira algorithm (Cherry and Foster, 2012) on the development set.
As opposed to pipelining, rescoring improves precision at the expense of recall and is more effective for the CoNLL data resulting in up to 54.95 4 The best system combinations are chosen again based on the development sets, i.e. CoNLL-2013 and JFLEG Dev. We omit these results as they are highly overestimated. 5 We did not observed any improvements if the order of the systems is reversed.  M 2 . On JFLEG, rescoring only with the RNN LM produces similar results as rescoring with the NMT ensemble. However, the best result for rescoring is lower than for pipelining on that test set. It seems the SMT system is not able to produce as diversified corrections in an n-best list as those generated by the NMT ensemble.

Spelling correction and final results
Pipelining the NMT-rescored SMT system and the NMT system leads to further improvement. We believe this can be explained by different contributions to precision and recall trade-offs for the two methods, similar to effects observed for the combination of the NMT ensemble and our RNN LM. On top of our final hybrid system we add a spellchecking component, which is run before pipelining. We use a character-level SMT system following Chollampatt and Ng (2017) which they deploy for unknown words in their word-based SMT system. As our BPE-based SMT does not really suffer from unknown words, we run the spell-checking component on words that would have been segmented by the BPE algorithm. This last system achieves the best results reported in this paper: 56.25 M 2 on CoNLL-2014 and 61.50 GLEU on JFLEG Test. System Example Source but now every thing is change , the life becom more dificullty .

Best SMT
But now everything is changed , the life becom more dificullty .

Best NMT
But now everything is changing , the life becomes more difficult .

Pipeline
But now everything is changed , the life becomes more difficult .

Rescoring
But now everything has changed , the life becom more dificullty . + Pipeline But now everything has changed , the life becomes more difficult .

Reference 1
Now everything has changed , and life becomes more difficult .

Reference 2
Everything has changed now and life has become more difficult .

Reference 3
But now that everything changes , life becomes more difficult .

Reference 4
But now that everything is changing , life becomes more difficult .

Analysis and future work
For both benchmarks our systems are close to automatic evaluation results that have been claimed to correspond to human-level performance on the CoNLL-2014 test set and on JFLEG Test. Table 5 shows system outputs for an example source sentence from the JFLEG Test corpus that illustrate the complementarity of the statistical and neural approaches. The SMT and NMT systems produce different corrections.

Example outputs
Rescoring is able to generate a unique correction (is change→has changed), but it fails in generating some corrections from the neural system, e.g. misspellings (becom and dificullty). Pipelining, on the other hand, may not improve a local correction made by the SMT system (is changed). The combination of the two methods produces output, which is most similar to the references.
Comparison with human annotations Bryant and Ng (2015) created an extension of the CoNLL-2014 test set with 10 annotators in total, JFLEG already incorporates corrections from 4 annotators. Human-level results for M 2 and GLEU were calculated by averaging the scores for each annotator with regard to the remaining 9 (CoNLL) or 3 (JF-LEG) annotators, respectively. Figure 3 contains human level scores, our results, and previously best reported results by Chollampatt and Ng (2017). Our best system reaches nearly 100% of the average human score according to M 2 and nearly 99% for GLEU being much closer to that bound than previous works 6 . 6 During the camera-ready preparation, Chollampatt and Ng (2018) have published a GEC system based on a multilayer convolutional encoder-decoder neural network with a character-based spell-checking module improving the previous best result to 54.79 M 2 on CoNLL-2014 and 57.47 GLEU on JFLEG Test. Further inspection reveals, however, that the precision/recall trade-off for the automatic system indicates lower coverage compared to human corrections -lower recall is compensated with high precision 7 . Automatic systems might, for example, miss some obvious error corrections and therefore easily be distinguishable from human references. Future work would require a human evaluation effort to draw more conclusions.