The CUED’s Grammatical Error Correction Systems for BEA-2019

We describe two entries from the Cambridge University Engineering Department to the BEA 2019 Shared Task on grammatical error correction. Our submission to the low-resource track is based on prior work on using finite state transducers together with strong neural language models. Our system for the restricted track is a purely neural system consisting of neural language models and neural machine translation models trained with back-translation and a combination of checkpoint averaging and fine-tuning – without the help of any additional tools like spell checkers. The latter system has been used inside a separate system combination entry in cooperation with the Cambridge University Computer Lab.


Introduction
The automatic correction of errors in text [In a such situaction → In such a situation] is receiving more and more attention from the natural language processing community.A series of competitions has been devoted to grammatical error correction (GEC): the CoNLL-2013 shared task (Ng et al., 2013), the CoNLL-2014 shared task (Ng et al., 2014), and finally the BEA 2019 shared task (Bryant et al., 2019).This paper presents the contributions from the Cambridge University Engineering Department to the latest GEC competition at the BEA 2019 workshop.
We submitted systems to two different tracks.The low-resource track did not permit the use of parallel training data except a small development set with around 4K sentence pairs.For our low-resource system we extended our prior work on finite state transducer based GEC (Stahlberg et al., 2019) to handle new error types such as punctuation errors as well as insertions and deletions of a small number of frequent words.For the restricted track, the organizers provided 1.2M pairs (560K without identity mappings) of corrected and uncorrected sentences.Our goal on the restricted track was to explore the potential of purely neural models for grammatical error correction. 1 We confirm the results of Kasewa et al. (2018) and report substantial gains by applying back-translation (Sennrich et al., 2016b) to GEC -a data augmentation technique common in machine translation.Furthermore, we noticed that large parts of the training data do not match the target domain.We mitigated the domain gap by over-sampling the in-domain training corpus, and by fine-tuning through continued training.Our final model is an ensemble of four neural machine translation (NMT) models and two neural language models (LMs) with Transformer architecture (Vaswani et al., 2017).Our purely neural system was also part of the joint submission with the Cambridge University Computer Lab described by Yuan et al. (2019).

FST-based Grammatical Error Correction
Stahlberg et al. ( 2019) investigated the use of finite state transducers (FSTs) for neural grammatical error correction.They proposed a cascade of FST compositions to construct a hypothesis space which is then rescored with a neural language model.We will outline this approach and explain our modifications in this section.For more details we refer to (Stahlberg et al., 2019).In a first step, the source sentence is converted to an FST I (Fig. 1).This initial FST is augmented by composition (denoted with the •-operator) with various other FSTs to cover different error types.Composition is a widely used standard operation 1 Models will be published at http://ucam-smt.github.io/sgnmt/html/bea19_gec.html.on FSTs and supported efficiently by FST toolkits such as OpenFST (Allauzen et al., 2007).We construct the hypothesis space as follows:2

arXiv:1907.00168v1 [cs.CL] 29 Jun 2019
1. We compose the input I with the deletion transducer D in Fig. 2. D allows to delete tokens on the short list shown in Tab. 1 at a cost λ del .We selected R by looking up all tokens which have been deleted in the dev set more than five times and manually filtered that list slightly.We did not use the full list of dev set deletions to avoid under-estimating λ del in tuning.
2. In a next step, we compose the transducer from step 1 with the edit transducer E in Fig. 3.This step addresses substitution errors such as spelling or morphology errors.Figure 4: Insertion FST A for adding the symbols ",", "-", and "'s" at a cost of λ ins .The σ-label matches any symbol and maps it to itself at no cost.
Like Stahlberg et al. (2019), we use the confusion sets of Bryant and Briscoe (2018) based on CyHunspell for spell checking (Rodriguez and Seal, 2014), the AGID morphology database for morphology errors (Atkinson, 2011), and manually defined corrections for determiner and preposition errors to construct E. Additionally, we extracted all substitution errors from the BEA-2019 dev set which occurred more than five times, and added a small number of manually defined rules that fix tokenization around punctuation symbols.
3. We found it challenging to allow insertions in LM-based GEC because the LM often prefers inserting words with high unigram probability such as articles and prepositions before less predictable words like proper names.We therefore restrict insertions to the three tokens ",", "-", and "'s" and allow only one insertion per sentence.We achieve this by adding the transducer A in Fig. 4 to our composition cascade.
4. Finally, we map the word-level FSTs to the subword-level by composition with a mapping transducer T that applies byte pair encoding (Sennrich et al., 2016c, BPE) to the full words.Word-to-BPE mapping transducers have been used in prior work to combine word-level models with subword-level neural sequence models (Stahlberg et al., 2019(Stahlberg et al., , 2017b(Stahlberg et al., , 2018b(Stahlberg et al., , 2017a)).
In a more condensed form, we can describe the final transducer as: with D for deletions, E for substitutions, A for insertions, and T for converting words to BPE tokens.Path scores in the FST in Eq. 1 are the accumulated penalties λ del , λ sub , and λ ins .The λ-parameters are tuned on the dev set using a variant of Powell search (Powell, 1964).We apply standard FST operations like output projection, -removal, determinization, minimization, and weight pushing (Mohri, 1997;Mohri and Riley, 2001) to help downstream decoding.Following Stahlberg et al. (2019) we then use the resulting transducer to constrain a neural LM beam decoder.

Results
We report M2 (Dahlmeier and Ng, 2012) scores on the CoNLL-2014 test set (Ng et al., 2014) and span-based ERRANT scores (Bryant et al., 2017) on the BEA-2019 dev set (Bryant et al., 2019).On CoNLL-2014 we compare with the best published results with comparable amount of parallel training data.We refer to (Bryant et al., 2019) for a full comparison of BEA-2019 systems.We tune our systems on BEA-2019 and only report the performance on CoNLL-2014 for comparison to prior work.
Tab. 2 summarizes our low-resource experiments.Our substitution-only system already outperforms the prior work of Stahlberg et al. (2019).Allowing for deletions and insertions improves the ERRANT score on BEA-2019 Dev by 2.57 points.We report further gains on both test sets by ensembling two language models and increasing the beam size.

Restricted Track Submission
In contrast to our low-resource submission, our restricted system entirely relies on neural models and does not use any external NLP tools, spell checkers, or hand-crafted confusion sets.For simplicity, we also chose to use standard implementations (Vaswani et al., 2018) of standard Transformer (Vaswani et al., 2017) models with standard hyper-parameters.This makes our final system easy to deploy as it is a simple ensemble of standard neural models with minimal preprocessing (subword segmentation).Our contributions on this track focus on NMT training techniques such as over-sampling, back-translation, and fine-tuning.We show that over-sampling effectively reduces domain mismatch.We found back-translation (Sennrich et al., 2016b) to be a very effective technique to utilize unannotated training data.However, while over-sampling is commonly used in machine translation to balance the number of real and back-translated training sentences, we report that using over-sampling this way for GEC hurts performance.Finally, we propose a combination of checkpoint averaging (Junczys-Dowmunt et al., 2016) and continued training to adapt our NMT models to the target domain.

Experimental Setup
We use neural LMs and neural machine translation (NMT) models in our restricted track entry.Our neural LM is as described in Sec.2.2.Our LMs and NMT models share the same subword segmentation.We perform exploratory NMT experiments with the BASE setup, but switch to the BIG setup for our final models.Tab. 4 shows the differences between both setups.Tab. 5 lists some corpus statistics for the BEA-2019 training sets.In our experiments without fine-tuning we decode with the average of the 20 most recent checkpoints (Junczys-Dowmunt et al., 2016).We use the SGNMT decoder (Stahlberg et al., 2017b(Stahlberg et al., , 2018b) ) in all our experiments.
In-domain corpus over-sampling The BEA-2019 training corpora (Tab.5) differ significantly not only in size but also their closeness to the target domain.The W&I+LOCNESS corpus is most similar to the BEA-2019 dev and test sets in terms of domains and the distribution over English language proficiency, but only consists of 34K sentence pairs.To increase the importance of in-domain training samples we over-sampled the W&I+LOCNESS corpus with different rates.
Tab. 6 shows that over-sampling by factor 4 (i.e.adding the W&I+LOCNESS corpus four times to the training set) improves the ERRAMT F 0.5score by 2.2 points on the BEA-2019 dev set and does not lead to substantial losses on the CoNLL-2014 test set.We will over-sample the W&I+LOCNESS corpus by four in all subsequent experiments.
Removing identity mappings Previous works often suggested to remove unchanged sentences (i.e.source and target sentences are equal) from the training corpora (Stahlberg et al., 2019;Zhao et al., 2019;Grundkiewicz and Junczys-Dowmunt, 2018).We note that removing these identity mappings can be seen as measure to control the balance between precision and recall.As shown in Tab. 7, removing identities encourages the model to make more corrections and thus leads to higher recall but lower precision.It depends on the test set whether this results in an improvement in F 0.5 score.For the subsequent experiments we found that removing identities in the parallel training corpora but not in the back-translated synthetic data works well in practice.
Back-translation Back-translation (Sennrich et al., 2016b) has become the most widely used technique to use monolingual data in neural machine translation.Back-translation extends the existing parallel training set by additional training samples with real English target sentences but synthetic source sentences.Different methods have been proposed to synthesize the source sentence such as using dummy tokens (Sennrich et al., 2016b), copying the target sentence (Currey et al., 2017), or sampling from or decoding with a reverse sequence-to-sequence model (Sennrich et al., 2016b;Edunov et al., 2018;Kasewa et al., 2018).The most popular approach is to generate the synthetic source sentences with a reverse model that is trained to transform target to source sentences using beam search.In GEC, this means that the reverse model learns to introduce errors into a correct English sentence.Back-translation has been applied successfully to GEC by Kasewa et al. (2018).We confirm the effectiveness of back-translation in GEC and discuss some of the differences between applying this technique to grammatical error correction and machine translation.
Our experiments with back-translation are summarized in Tab. 8. Adding 1M synthetic sentences to the training data already yields very substantial gains on both test sets.We achieve our best results with 5M synthetic sentences (+8.44 on BEA-2019 Dev).In machine translation, it is important to maintain a balance between authentic and synthetic data (Sennrich et al., 2016b;Poncelas et al., 2018;Sennrich et al., 2016a).Over-sampling the real data is a common practice to rectify that ratio if large amounts of synthetic data are available.Interestingly, over-sampling real data in GEC hurts performance (row 3 vs. 5 in Tab.8), and it is possible to mix real and synthetic sentences at a ratio of 1:7.9 (last three rows in Tab. 8).We will proceed with the 5M setup for the remainder of this paper.
Fine-tuning As explained previously, we oversample the W&I+LOCNESS corpus by factor 4 to mitigate the domain gap between the training set and the BEA-2019 dev and test sets.To further adapt our system to the target domain, we fine-   Final system Tab. 10 contains our experiments with the BIG configuration.
In addition to W&I+LOCNESS over-sampling, back-translation with 5M sentences, and fine-tuning with checkpoint averaging, we report further gains by adding the language models from our low-resource system (Sec.2.2) and ensembling.Our best system (4 NMT models, 2 language models) achieves 58.9 M2 on CoNLL-2014, which is slightly (2.25 points) worse than the best published result on that test set (Zhao et al., 2019).However, we note that we have tailored our system towards the BEA-2019 dev set and not the CoNLL-2013 or CoNLL-2014 test sets.As we argued in Sec.2.4, our results throughout this work suggest strongly that the optimal system parameters for these test sets are very different from each other, and that our final system settings are not optimal for CoNLL-2014.We also note that unlike the system of Zhao et al. (2019), our system for the restricted track does not use spell checkers or other NLP tools but relies solely on neural sequence models.

Conclusion
We participated in the BEA 2019 Shared Task on grammatical error correction with submissions to the low-resource and the restricted track.Our lowresource system is an extension of prior work on FST-based GEC (Stahlberg et al., 2019) to allow insertions and deletions.Our restricted track submission is a purely neural system based on standard NMT and LM architectures.We pointed out the similarity between GEC and machine translation, and demonstrated that several techniques which originate from MT research such as oversampling, back-translation, and fine-tuning, are also useful for GEC.Our models have been used in a joint submission with the Cambridge University Computer Lab (Yuan et al., 2019).

Figure 1 :
Figure 1: Input FST I representing the source sentence 'In a such situaction there is no other way.'.We follow standard convention and highlight the start state in bold and the final state with a double circle.

Figure 2 :
Figure 2: Deletion FST D which can map any token in the list R from Tab. 1 to .The σ-label matches any symbol and maps it to itself.

Figure 3 :
Figure 3: Edit FST E which allows substitutions with a cost of λ sub .The σ-label matches any symbol and maps it to itself at no cost.

Figure 5 :
Figure 5: Span-based ERRANT F 0.5 scores on the BEA-2019 dev set over the number of fine-tuning training iterations (single GPU, SGD delay factor (Saunders et al., 2018) of 16).

Table 1 :
List of tokens R that can be deleted by the deletion transducer D in Fig.2.

Table 2 :
Results on the low-resource track.The λ-parameters are tuned on the BEA-2019 dev set.

Table 3 :
Number of correction types in CoNLL-2014 and BEA-2019 Dev references.

Table 4 :
NMT setups BASE and BIG used in our experiments for the restricted track.

Table 5 :
BEA-2019parallel training data with and without removing pairs where source and target sentences are the same.

Table 6 :
Over-sampling the BEA-2019 in-domain corpus W&I+LOCNESS under BASE models.The second column contains the ratio of W&I+LOCNESS samples to training samples from the other corpora.

Table 7 :
Impact of identity removal on BASE models.

Table 8 :
Using back-translation for GEC (BASE models).The third column contains the ratio between real and synthetic sentence pairs.

Table 9 :
Fine-tuning through continued training on W&I+LOCNESS and checkpoint averaging with a BASE model with 5M back-translated sentences.

Table 10 :
Final results on the restricted track with BIG models and back-translation.