Seq2Edits: Sequence Transduction Using Span-level Edit Operations

We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting&rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.


Introduction
Neural models that generate a target sequence conditioned on a source sequence were initially proposed for machine translation (MT) (Sutskever et al., 2014;Kalchbrenner and Blunsom, 2013;Bahdanau et al., 2015;Vaswani et al., 2017), but are now used widely as a central component of a variety of NLP systems (e.g. Tan et al. (2017); Chollampatt and Ng (2018)). Raffel et al. (2019) argue that even problems that are traditionally not viewed from a sequence transduction perspective can benefit from massive pre-training when framed as a text-to-text problem. However, for many NLP tasks such as correcting grammatical errors in a sentence, the input and output sequence may overlap significantly. Employing a full sequence model in these cases is often wasteful as most tokens are simply copied over from the input to the output.
Another disadvantage of a full sequence model is that it does not provide an explanation for why it proposes a particular target sequence.
In this work, inspired by a recent increased interest in text-editing (Dong et al., 2019;Malmi et al., 2019;Mallinson et al., 2020;Awasthi et al., 2019), we propose Seq2Edits, a sequence editing model which is tailored towards problems that require only small changes to the input. Rather than generating the target sentence as a series of tokens, our model predicts a sequence of edit operations that, when applied to the source sentence, yields the target sentence. Each edit operates on a span in the source sentence and either copies, deletes, or replaces it with one or more target tokens. Edits are generated auto-regressively from left to right using a modified Transformer (Vaswani et al., 2017) architecture to facilitate learning of long-range dependencies. We apply our edit operation based model to five NLP tasks: text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction (GEC). Our model is competitive across all of these tasks, and improves the state-of-the-art on text normalization (Sproat and Jaitly, 2016), sentence splitting & rephrasing (Botha et al., 2018), and the JFLEG test set (Napoles et al., 2017) for GEC.
Our model is often much faster than a full sequence model for these tasks because its runtime depends on the number of edits rather than the target sentence length. For instance, we report speedups of >5x on GEC for native English in initial experiments. If applicable, we also predict a taskspecific edit-type class ("tag") along with each edit which explains why that edit was proposed. For example in GEC, the correction of a misspelled word would be labelled with a SPELL (spelling error) whereas changing a word from say first person to third person would be associated with a tag such as SVA (subject-verb agreement error). Figure 1: Representing grammatical error correction as a sequence of span-based edit operations. The implicit start position for a source span is the end position of the previous edit operation. SELF indicates spans that are copied over from the source sentence (x). The probability of the first two edits is given by: P (After many years ,|x) = P (t 1 = SELF|x) · P (p 1 = 3|SELF, x) · P (r 1 = SELF|SELF, 3, x) · P (t 2 = PUNCT|SELF, 3, SELF, x) · P (p 2 = 3|SELF, 3, SELF, PUNCT, x) · P (r 2 = ,|SELF, 3, SELF, PUNCT, 3, x).
2 Edit-based Sequence Transduction

Representation
A vanilla sequence-to-sequence (seq2seq) model generates a plain target sequence y = y J 1 = y 1 , y 2 , . . . , y J ∈ V J of length J given a source sequence x = x I 1 = x 1 , x 2 , . . . , x I ∈ V I of length I over a vocabulary V of tokens (e.g. subword units (Sennrich et al., 2016)). For example, in our running grammar correction example in Fig. 1, the source sequence is the ungrammatical sentence x ="After many years he still dream to become a super hero ." and the target sequence is the corrected sentence y ="After many years , he still dreams of becoming a super hero .". The probability P (y|x) is factorized using the chain rule: Instead of predicting the target sequence y directly, the Seq2Edits model predicts a sequence of N edit operations. Each edit operation (t n , p n , r n ) ∈ T × N 0 × V is a 3-tuple that represents the action of replacing the span from positions p n−1 to p n in the source sentence with the replacement token r n associated with an explainable tag t n (T is the tag vocabulary). 1 t n = r n = SELF indicates that the source span is kept as-is. Insertions are modelled with p n = p n−1 that corresponds to an empty source span (see the insertion of "," in Fig. 1), deletions are represented with a special token r n = DEL. The edit operation sequence for our running example is shown in Fig. 1. The target sequence y can be obtained from the edit operation sequence using Algorithm 1. y ← concat(y, r n ) 8: end if 9: end for 10: return y Our motivation behind using span-level edits rather than token-level edits is that the representations are much more compact and easier to learn since local dependencies (within the span) are easier to capture. For some of the tasks it is also more natural to approach the problem on the span-level: a grammatical error is often fixed with more than one (sub)word, and span-level edits retain the language modelling aspect within a span.
Our representation is flexible as it can represent any sequence pair. As an example, a trivial (but not practical) way to construct an edit sequence for any pair (x, y) is to start with a deletion for the entire source sentence x (p 1 = I, r 1 = DEL) and then insert the tokens in y (p j+1 = I, r j+1 = y j for j ∈ [1, J]).
Edit sequences are valid iff. spans are in a monotonic left-to-right order and the final span ends at the end of the source sequence, i.e.: None of our models produced invalid sequences at inference time even though we did not implement any feasibility constraints.

Inference
The output of the edit operation model is a sequence of 3-tuples rather than a sequence of tokens. The probability of the output is computed as: For inference, we factorize the conditional probabilities further as: The decoding problem can thus be written as a flat product of conditional probabilities that correspond to tag, span and replacement predictions, that are interleaved: arg max At inference time we perform beam decoding over this flat factorization to search for the most likely edit operation sequence. In practice, we scale the scores for the different target features: where the three scaling factors λ t , λ p , λ r are optimized on the respective development set.

Neural Architecture
Our neural model (illustrated in Fig. 2) is a generalization of the original Transformer architecture of Vaswani et al. (2017). Similarly to the standard Transformer we feed back the predictions of the previous time step into the Transformer decoder (A). The feedback loop at time step n is implemented as the concatenation of an embedding of t n−1 , the p n−1 -th encoder state, and an embedding Figure 2: Seq2Edits consists of a Transformer (Vaswani et al., 2017) encoder and a Transformer decoder that is divided horizontally into two parts (A and B). The tag and span predictions are located in the middle of the decoder layer stack between both parts. A single step of prediction is shown.  of r n−1 . The Transformer decoder A is followed by a cascade of tag prediction and span end position prediction. We follow the idea of pointer networks  to predict the source span end position using the attention weights over encoder states as probabilities. The input to the pointer network are the encoder states (keys and values) and the output of the previous decoder layer (queries). The span end position prediction module is a Transformer-style (Vaswani et al., 2017) single-head attention ("scaled dot-product") layer over the encoder states: where Q is a d-dimensional linear transform of the previous decoder layer output at time step n, K ∈ R I×d is a linear transform of the encoder states, and d is the number of hidden units. A 6-dimensional embedding of the predicted tag t n and the encoder state corresponding to the source span end position p n are fed into yet another Transformer decoder (B) that predicts the replacement token r n . Alternatively, one can view A and  B as a single Transformer decoder layer stack, in which we added the tag prediction and the span end position prediction as additional layers in the middle of that stack. We connect the decoders A and B with residual connections (He et al., 2016) to facilitate learning. The network is trained by optimizing the sum of three cross-entropies that correspond to tag prediction, span prediction, and replacement token prediction, respectively. 2 In our experiments without tags we leave out the tag prediction layer and the loss computed from it. We experiment with two different model sizes: "Base" and "Big". The hyper-parameters for both configurations are summarized in Table 1. Hyper-parameters not listed here are borrowed from the transformer clean base and transformer clean big configurations in the Tensor2Tensor toolkit (Vaswani et al., 2018).

Experiments
We evaluate our edit model on five NLP tasks: 3 • Text normalization for speech applications (Sproat and Jaitly, 2016) -converting number expressions such as "123" to their verbalizations (e.g. "one two three" or "one hundred twenty three", etc.) depending on the context.
• Sentence fusion (Geva et al., 2019) -merging two independent sentences to a single coherent one, e.g. "I need his spirit to be free. I can leave my body." → "I need his spirit to be free so that I can leave my body." • Sentence splitting & rephrasing (Botha et al., 2018) -splitting a long sentence into two fluent sentences, e.g. "Bo Saris was born in Venlo, Netherlands, and now resides in London, England." → "Bo Saris was born in Venlo , Netherlands. He currently resides in London, England." • Text simplification (Zhang and Lapata, 2017)reducing the linguistic complexity of text, e.g. "The family name is derived from the genus Vitis." → "The family name comes from the genus Vitis." • Grammatical error correction (Ng et al., 2014;Napoles et al., 2017;Bryant et al., 2019) -correcting grammatical errors in written text, e.g. "In a such situaction" → "In such a situation".
Our models are trained on packed examples (maximum length: 256) with Adafactor (Shazeer and Stern, 2018) using the Tensor2Tensor (Vaswani et al., 2018) library. We report results both with and without pre-training. Our pre-trained models for all tasks are trained for 1M iterations on 170M sentences extracted from English Wikipedia revisions and 176M sentences from English Wikipedia round-trip translated via German (Lichtarge et al., 2019). For grammatical error correction we use ERRANT (Bryant et al., 2017;Felice et al., 2016) to derive span-level edits from the parallel text. On other tasks we use a minimum edit distance heuristic to find a token-level edit sequence and convert it to span-level edits by merging neighboring edits.
The task-specific data is described in Table 2. The number of iterations in task-specific training is set empirically based on the performance on the development set (between 20K-75K for fine-tuning,   Table 4: Single model results. For metrics marked with "↑" (SARI, P(recision), R(ecall), F 0.5 ) high scores are favorable, whereas the sentence error rate (SER) is marked with "↓" to indicate the preference for low values. Tuning refers to optimizing the decoding parameters listed in Table 3 on the development sets.
between 100K-300K for training from scratch). Fine-tuning is performed with a reduced learning rate of 3 × 10 −5 . For each task we tune a different set of decoding parameters (Table 3) such as the λ-parameters from Sec. 2.2, on the respective development sets. For text simplification and grammatical error correction, we perform multiple beam search passes (between two and four) by feeding back the output of beam search as input to the next beam search pass. This is very similar to the iterative decoding strategies used by  2019) with the difference that we pass through n-best lists between beam search passes rather than only the single best hypothesis. During iterative refinement we follow Lichtarge et al. (2019) and multiply the score of the identity mapping by a tunable identity penalty to better control the trade-off between precision and recall. We use a beam size of 12 in our rescoring experiments in Table 10 and a beam size of 4 otherwise. Table 4 gives an overview of our results on all tasks. The tag set consists of semiotic class labels (Sproat and Jaitly, 2016) for text normalization, discourse type (Geva et al., 2019) for sentence fusion, and ERRANT (Bryant et al., 2017;Felice et al., 2016) error tags for grammatical error correction. 4 For the other tasks we use a trivial tag set: SELF, NON SELF, and EOS (end of sequence). We report sentence error rates (SERs↓) for text normalization, SARI↑ scores (Xu et al.,4 Appendix A lists the full task-specific tag vocabularies. 2016) for sentence fusion, splitting, and simplification, and ERRANT (Bryant et al., 2017) spanbased P(recision)↑, R(ecall)↑, and F 0.5 -scores on the BEA development set for grammar correction. 5 Text normalization is not amenable to our form of pre-training as it does not use subword units and it aims to generate vocalizations rather than text like in our pre-training data. All other tasks, however, benefit greatly from pre-training (compare rows a & b with rows c & d in Table 4). Pretraining yields large gains for grammar correction as the pre-training data was specifically collected for this task (Lichtarge et al., 2019). Tuning the decoding parameters (listed in Table 3) gives improvements for tasks like simplification, but is less crucial on sentence fusion or splitting (compare rows c & d with rows e & f in Table 4). Using tags is especially effective if non-trivial tags are available (compare rows a with b, c with d, and e with f for text normalization and grammar correction).
We next situate our best results (big models with pre-training in rows e and f of Table 4) in the context of related work.

Text Normalization
Natural sounding speech synthesis requires the correct pronunciation of numbers based on context e.g. whether the string 123 should be spoken as one hundred twenty three or one two three. Text normalization converts the textual representation of numbers or other semiotic classes such as abbre-   viations to their spoken form while both conveying meaning and morphology (Zhang et al., 2019). The problem is highly context-dependent as abbreviations and numbers often have different feasible vocalizations. Context-dependence is even more pronounced in languages like Russian in which the number words need to be inflected to preserve agreement with context words. We trained our models on the English and Russian data provided by Sproat and Jaitly (2016). Similarly to others (Zhang et al., 2019;Sproat and Jaitly, 2016) we use characters on the input but full words on the output side. Table 5 shows that our models perform favourably when compared to other systems from the literature on both English and Russian. Note that most existing neural text normalization models (Zhang et al., 2019;Sproat and Jaitly, 2016) require pre-tokenized input 6 whereas our edit model operates on the untokenized input character sequence. This makes our method attractive for low resource languages where high-quality tokenizers may not be available.

Sentence Fusion
Sentence fusion is the task of merging two independent sentences into a single coherent text and has applications in several NLP areas such as dialogue systems and question answering (Geva et al., 2019). Our model is on par with the FE-LIX tagger (Mallinson et al., 2020)

Sentence Splitting & Rephrasing
Sentence splitting is the inverse task of sentence fusion: Split a long sentence into two fluent shorter sentences. Our models are trained on the WikiSplit dataset (Botha et al., 2018) extracted from the edit history of Wikipedia articles. In addition to SARI scores we report the number of exact matches in Table 7. Our edit-based model achieves a higher number of exact matches and a higher SARI score compared to prior work on sentence splitting.

Text Simplification
Our text simplification training set (the WikiLarge corpus (Zhang and Lapata, 2017)) consists of 296K examples, the smallest amongst all our training corpora. Table 8 shows that our model is competitive, demonstrating that it can benefit from even limited quantities of training data. However, it does not improve the state of the art on this test set.

Grammatical Error Correction
For grammatical error correction we follow a multistage fine-tuning setup. 8 After training on the common pre-training data described above, we finetune for 30K steps on the public    (Napoles et al., 2017). However, the training data used in the literature varies vastly from system to system which limits comparability as (synthetic) data significantly impacts the system performance (Grundkiewicz et al., 2019). Table 9 compares our approach with the best single model results reported in the literature. Our model tends to have a lower precision but higher recall than other systems. We are able to achieve the highest GLEU score on the JFLEG test set.
To further improve performance, we applied two techniques commonly used for grammatical error correction (Grundkiewicz et al., 2019, inter alia): ensembling and rescoring. Table 10 compares our 5-ensemble with other ensembles in the literature. Rescoring the n-best list from the edit model with a big full sequence Transformer model yields significant gains, outperforming all other systems in Table 10 on BEA-test and JFLEG. 9 One of our initial goals was to avoid the wasteful computation of full sequence models when applied to tasks like grammatical error correction with a 9 This resembles the inverse setup of Chollampatt and Ng (2018) who used edit features to rescore a full sequence model. high degree of copying. Table 11 summarizes CPU decoding times on an Intel R Xeon R W-2135 Processor (12-core, 3.7 GHz). 10 We break down the measurements by English proficiency level. The full sequence baseline slows down for higher proficiency levels as sentences tend to be longer (second column of Table 11). In contrast, our edit operation based approach is faster because it does not depend on the target sequence length but instead on the number of edits which is usually small for advanced and native English speakers. We report speed-ups of 4.7-4.8x in these cases without using tags. When using tags, we implemented the following simple heuristics to improve the runtime ("shortcuts"): 1) If the tag t n = SELF is predicted, directly set r n = SELF and skip the replacement token prediction. 2) If the tag t n = EOS is predicted, set p n = I and r n = EOS and skip the span end position and the replacement token predictions. These shortcuts do not affect the results in practice but provide speed-ups of 5.2x for advanced and native English compared to a full sequence model.
The speed-ups of our approach are mainly due to the shorter target sequence length compared to a full sequence model. However, our inference scheme in Sec. 2.2 still needs three times more   time steps than the number of edits, i.e. around 4.0 × 3 = 12 for native English (last row in Table 11). The observed speed-ups of 4.0x-5.2x are even larger than we would expect based on an average source length of I = 26.6. One reason for the larger speed-ups is that the decoding runtime complexity under the Transformer is quadratic in length, not linear. Another reason is that although the three elements are predicted sequentially, not each prediction step is equally expensive: the softmax for the tag and span predictions is computed over only a couple of elements, not over the full 32K subword vocabulary. Furthermore, efficient decoder implementations could reuse most of the computation done for the tag prediction for the span position.

Oracle Experiments
Our model can be viewed from a multi-task perspective since it tries to predict three different features (tag, span position, and replacement token).
To better understand the contributions of these different components we partially constrained them using the gold references for both text normalization and grammatical error correction tasks. We avoid constraining the number of edits (N ) in these oracle experiments by giving the constrained decoder the option to repeat labels in the reference. Table 12 shows that having access to the gold tags and/or span positions greatly improves performance. We hypothesize that these gains can be largely attributed to the resolution of confusion between self and non-self. An interesting outlier is text normalization on Russian which benefits less from oracle constraints. This suggests that the chal-  lenges for Russian text normalization are largely in predicting the replacement tokens, possibly due to the morphological complexity of Russian.
Since the motivation for using tags was to improve explainability we also evaluated the accuracy of the tag prediction on grammatical error correction. For comparison, we trained a baseline Lasertagger (Malmi et al., 2019) on a subset of the BEA training set (30.4K examples) to predict the ERRANT tags. Insertions are represented as composite tags together with the subsequent tag such that the total Lasertagger vocabulary size is 213. The model was initialized from a pre-trained BERT (Devlin et al., 2019) checkpoint. Decoding was performed using an autoregressive strategy with a Transformer decoder. We used the default hyper-parameters without any task-specific optimization. Table 13 shows that the tag prediction of our unconstrained model is more accurate than the Lasertagger baseline. Errors in this unconstrained setup are either due to predicting the wrong tag or predicting the wrong span. To tease apart these error sources we also report the accuracy under oracle span constraints. Our span constrained model achieves a recall of 52.4%, i.e. more than half of the non-self tags are classified correctly (28 tags).
A popular way to tackle NLP problems with overlap between input and output is to equip seq2seq models with a copying mechanism (Jia and Liang, 2016;Zhao et al., 2019;Chen and Bansal, 2018;Gulcehre et al., 2016;See et al., 2017;Gu et al., 2016), usually borrowing ideas from pointer networks  to point to single tokens in the source sequence. In contrast, we use pointer networks to identify entire spans that are to be copied which results in a much more compact representation and faster decoding. The idea of using span-level edits has also been explored for morphological learning in the form of symbolic span-level rules, but not in combination with neural models (Elsner et al., 2019).
Our work is related to neural multi-task learning for NLP (Collobert and Weston, 2008;Dong et al., 2015;Luong et al., 2015;Søgaard and Goldberg, 2016). Unlike multi-task learning which typically solves separate problems (e.g. POS tagging and named entity recognition (Collobert and Weston, 2008) or translation into different languages (Luong et al., 2015;Dong et al., 2015)) with the same model, our three output features (tag, source span, and replacement) represent the same output sequence (Algorithm 1). Thus, it resembles the stack-propagation approach of Zhang and Weiss (2016) who use POS tags to improve parsing performance.
A more recent line of research frames sequence editing as a labelling problem using labels such as ADD, KEEP, and DELETE (Ribeiro et al., 2018;Dong et al., 2019;Mallinson et al., 2020;Malmi et al., 2019;Awasthi et al., 2019), often relying heavily on BERT (Devlin et al., 2019) pre-training. Similar operations such as insertions and deletions have also been used for machine translation (Gu et al., 2019b;Stern et al., 2019;Gu et al., 2019a;Ostling and Tiedemann, 2017;Stahlberg et al., 2018). We showed in Sec. 3 that our model often performs similarly or better than those approaches, with the added advantage of providing explanations for its predictions.

Discussion
We have presented a neural model that represents sequence transduction using span-based edit operations. We reported competitive results on five different NLP problems, improving the state of the art on text normalization, sentence splitting, and the JFLEG test set for grammatical error correction. We showed that our approach is 2.0-5.2 times faster than a full sequence model for grammatical error correction. Our model can predict labels that explain each edit to improve the interpretability for the end-user. However, we do not make any claim that Seq2Edits can provide insights into the internal mechanics of the neural model. The underlying neural model in Seq2Edits is as much of a black-box as a regular full sequence model. While our model is advantageous in terms of speed and explainability, it does have some weaknesses. Notably, the model uses a tailored architecture (Figure 2) that would require some engineering effort to implement efficiently. Second, the output of the model tends to be less fluent than a regular full sequence model, as can be seen from the examples in Table 19. This is not an issue for localized edit tasks such as text normalization but may be a drawback for tasks involving substantial rewrites (e.g. GEC for non-native speakers).
Even though our approach is open-vocabulary, future work will explore task specific restrictions. For example, in a model for dialog applications, we may want to restrict the set of response tokens to a predefined list. Alternatively, it may be useful to explore generation in a non left-to-right order to improve the efficiency of inference.
Another line of future work is to extend our model to sequence rewriting tasks, such as Machine Translation post-editing, that do not have existing error-tag dictionaries. This research would require induction of error tag inventories using either linguistic insights or unsupervised methods. A Task-specific Tag Sets   Tables 14 to 16 list the non-trivial tag sets for text normalization, sentence fusion, and grammatical error correction respectively. In addition to the tags listed in the tables, we use the tags SELF and EOS (end of sequence). For sentence splitting and simplification we use the trivial tag set consisting of SELF, NON SELF, and EOS.

B Example Outputs
Tables 17 to 19 provide example outputs from the Seq2Edits model. We use word-level rather than subword-or character-level source positions and collapse multi-word replacements into a single operation in the edit representation examples for clarity. Example outputs of our sentence fusion system are shown in Table 17. The predicted tags capture the variety of strategies for sentence fusion, such as simple connector particles (SINGLE S COORD), cataphoras (SINGLE CATAPHORA), verb phrases (SINGLE VP COORD), relative clauses (SINGLE RELATIVE), and appositions (SINGLE APPOSITION). The last example in Table 17 demonstrates that our model is able to produce even major rewrites.
Table 18 compares our model with a full sequence baseline on English text normalization. Correctly predicting non-trivial tags helps our model to choose the right verbalizations. In the first example in Table 18, our model predicts the CARDINAL tag rather than ORDINAL and thus produces the correct verbalization for '93'. In the second example, our model generates a time expression for '1030' and '1230' as it predicted the DATE tag for these

WO
Word order ("only can" → "can only") seek small prey on the seabed .

Edits
(SELF, 9, SELF), (SINGLE S COORD, 11, ', but most'), (SELF, 18, SELF) Source 0 It 1 is 2 a 3 fan 4 favourite 5 . 6 It 7 has 8 been 9 played 10 in 11 almost 12 every 13 concert 14 to 15 date 16 since 17 its 18 initial 19 performance 20 . 21 Reference Being a fan favourite , it has been played in almost every concert to date since its initial performance . Edit model Being
Service operates from the Courthouse at one thousand thirty a m and one thousand two hundred thirty p m sil Edit model Service operates from the Courthouse at
thousand one hundred sixty eight sil november nineteen ninety one sil sil p p sil Edit model one hundred sixty eight  Source 0 It 1 will 2 be 3 very 4 cool 5 to 6 see 7 the 8 las 9 part 10 mokingjay 11 ! 12 Reference It will be very cool to see the last part of Mokingjay ! Full seq.
It will be very cool to see the last mokingjay ! Edit model It will be very cool to see the
If she was n't awake , why she could n't remember anything after that ? Edit model If she was n't awake , why SELF could n't she WO remember anything after that ?

Reference
Fewer channels means fewer choices . Full seq.
Less channels means fewer choices . Edit model Fewer ADJ channels means SELF fewer ADJ choices .

Reference
On the one hand , travel by car is really much more convenient , as it gives you the chance to be independent . Full seq.
On the one hand , travel by car is really much more convenient , as it gives you the chance to be independent . Edit model On the one hand  spans. The third example demonstrates that the edit model can avoid some of the 'unrecoverable' errors (Sproat and Jaitly, 2016) of the full sequence model such as mapping '168' to 'thousand one hundred sixty eight'. Finally, the grammatical error correction examples in Table 19 demonstrate the practical advan-tage of predicting tags along with the edits as they provide useful feedback to the user. The second example in Table 19 shows that our model is able to handle more complex operations such as word reorderings. However, our model fails to inflect "give" correctly in the last example, suggesting that one weakness of our edit model compared to a full sequence model is a weaker target side language model resulting in less fluent output. This issue can be mitigated by using stronger models e.g. this particular issue is fixed in our ensemble.