Parallel Iterative Edit Models for Local Sequence Transduction

We present a Parallel Iterative Edit (PIE) model for the problem of local sequence transduction arising in tasks like Grammatical error correction (GEC). Recent approaches are based on the popular encoder-decoder (ED) model for sequence to sequence learning. The ED model auto-regressively captures full dependency among output tokens but is slow due to sequential decoding. The PIE model does parallel decoding, giving up the advantage of modeling full dependency in the output, yet it achieves accuracy competitive with the ED model for four reasons: 1. predicting edits instead of tokens, 2. labeling sequences instead of generating sequences, 3. iteratively refining predictions to capture dependencies, and 4. factorizing logits over edits and their token argument to harness pre-trained language models like BERT. Experiments on tasks spanning GEC, OCR correction and spell correction demonstrate that the PIE model is an accurate and significantly faster alternative for local sequence transduction.


Introduction
In local sequence transduction (LST) an input sequence x 1 , . . . , x n needs to be mapped to an output sequence y 1 , . . . , y m where the x and y sequences differ only in a few positions, m is close to n, and x i , y j come from the same vocabulary Σ. An important application of local sequence transduction that we focus on in this paper is Grammatical error correction (GEC). We contrast local transduction with more general sequence transduction tasks like translation and paraphrasing which might entail different input-output vocabulary and non-local alignments. The general se- * Correspondence to: awasthi@cse.iitb.ac.in quence transduction task is cast as sequence to sequence (seq2seq) learning and modeled popularly using an attentional encoder-decoder (ED) model. The ED model auto-regressively produces each token y t in the output sequence conditioned on all previous tokens y 1 , . . . , y t−1 . Owing to the remarkable success of this model in challenging tasks like translation, almost all state-of-the-art neural models for GEC use it (Zhao et al., 2019;Lichtarge et al., 2019;Ge et al., 2018b;Chollampatt and Ng, 2018b;. We take a fresh look at local sequence transduction tasks and present a new parallel-iterativeedit (PIE) architecture. Unlike the prevalent ED model that is constrained to sequentially generating the tokens in the output, the PIE model generates the output in parallel, thereby substantially reducing the latency of sequential decoding on long inputs. However, matching the accuracy of existing ED models without the luxury of conditional generation is highly challenging. Recently, parallel models have also been explored in tasks like translation (Stern et al., 2018;Gu et al., 2018;Kaiser et al., 2018) and speech synthesis (van den Oord et al., 2018), but their accuracy is significantly lower than corresponding ED models. The PIE model incorporates the following four ideas to achieve comparable accuracy on tasks like GEC in spite of parallel decoding. 1. Output edits instead of tokens: First, instead of outputting tokens from a large vocabulary, we output edits such as copy, appends, deletes, replacements, and case-changes which generalize better across tokens and yield a much smaller vocabulary. Suppose in GEC we have an input sentence: fowler fed dog. Existing seq2seq learning approaches would need to output the four tokens Fowler, fed, the, dog from a word vocabulary whereas we would predict the edits {Capitalize token 1, Append(the) to token 2, Copy token 3}. 2. Sequence labeling instead of sequence generation: Second we perform in-place edits on the source tokens and formulate local sequence transduction as labeling the input tokens with edit commands, rather than solving the much harder whole sequence generation task involving a separate decoder and attention module. Since input and output lengths are different in general such formulation is non-trivial, particularly due to edits that insert words. We create special compounds edits that merge token inserts with preceding edits that yield higher accuracy than earlier methods of independently predicting inserts (Ribeiro et al., 2018;. 3. Iterative refinement: Third, we increase the inference capacity of the parallel model by iteratively inputting the model's own output for further refinement. This handles dependencies implicitly, in a way reminiscent of Iterative Conditional Modes (ICM) fitting in graphical model inference (Koller and Friedman, 2009). Lichtarge et al. (2019) and Ge et al. (2018b) also refine iteratively but with ED models. 4. Factorize pre-trained bidirectional LMs: Finally, we adapt recent pre-trained bidirectional models like BERT (Devlin et al., 2018) by factorizing the logit layer over edit commands and their token argument. Existing GEC systems typically rely on conventional forward directional LM to pretrain their decoder, whereas we show how to use a bi-directional LM in the encoder, and that too to predict edits.
Novel contributions of our work are as follows: • Recognizing GEC as a local sequence transduction (LST) problem, rather than machine translation. We then cast LST as a fast non-autoregressive, sequence labeling model as against existing auto-regressive encoderdecoder model.
• Our method of reducing LST to nonautoregressive sequence labeling has many novel elements: outputting edit operations instead of tokens, append operations instead of insertions in the edit space, and replacements along with custom transformations.
• We show how to effectively harness a pre-trained language model like BERT using our factorized logit architecture with edit-specific attention masks.
• The parallel inference in PIE is 5 to 15 times faster than a competitive ED based GEC model like (Lichtarge et al., 2019) which performs sequential decoding using beam-search. PIE also attains close to state of the art performance on standard GEC datasets. On two other local transduction tasks, viz., OCR and spell corrections the PIE model is fast and accurate w.r.t. other existing models developed specifically for local sequence transduction.

Our Method
We assume a fully supervised training setup where we are given a parallel dataset of incorrect, correct sequence pairs: D = {(x i , y i ) : i = 1 . . . N }, and an optional large corpus of unmatched correct sequences L = {ỹ 1 , . . . ,ỹ U }. In GEC, this could be a corpus of grammatically correct sentences used to pre-train a language model.
Background: existing ED model Existing seq2seq ED models factorize Pr(y|x) to capture the full dependency between a y t and all previous y <t = y 1 , . . . , y t−1 as m t=1 Pr(y t |y <t , x). An encoder converts input tokens x 1 , . . . , x n to contextual states h 1 , . . . , h n and a decoder summarizes y <t to a state s t . An attention distribution over contextual states computed from s t determines the relevant input context c t and the output token distribution is calculated as Pr(y t |y <t , x) = Pr(y t |c t , s t ). Decoding is done sequentially using beam-search. When a correct sequence corpus L is available, the decoder is pre-trained on a next-token prediction loss and/or a trained LM is used to re-rank the beam-search outputs (Zhao et al., 2019;Chollampatt and Ng, 2018a;.

Overview of the PIE model
We move from generating tokens in the output sequence y using a separate decoder, to labelling the input sequence x 1 , . . . , x n with edits e 1 , . . . , e n . For this we need to design a function Seq2Edits that takes as input an (x, y) pair in D and outputs a sequence e of edits from an edit space E where e is of the same length as x in spite of x and y being of different lengths. In Section 2.2 we show how we design such a function.
Training: We invoke Seq2Edits on D and learn the parameters of a probabilistic model Pr(e|x, θ) to assign a distribution over the edit labels on tokens of the input sequence. In Section 2.3 we describe the PIE architecture in more detail. The correct corpus L , when available, is used to pre-train the encoder to predict an arbitrarily masked token y t in a sequence y in L, much like in BERT. Unlike in existing seq2seq systems where L is used to pre-train the decoder that only captures forward dependencies, in our pre-training the predicted token y t is dependent on both forward and backward contexts. This is particularly useful for GEC-type tasks where future context y t+1 . . . y m can be approximated by x t+1 , . . . , x n .
Inference: Given an input x, the trained model predicts the edit distribution for each input token independent of others, that is Pr(e|x, θ) = n t=1 Pr(e t |x, t, θ), and thus does not entail the latency of sequential token generation of the ED model. We output the most probable editsê = argmax e Pr(e|x, θ). Edits are designed so that we can easily get the edited sequenceŷ after applyinĝ e on x.ŷ is further refined by iteratively applying the model on the generated outputsŷ until we get a sequence identical to one of the previous sequences upto a maximum number of iterations I.

The Seq2Edits Function
Given a x = x 1 , . . . , x n and y = y 1 , . . . , y m where m may not be equal to n, our goal is to obtain a sequence of edit operations e = (e 1 , . . . , e n ) : e i ∈ E such that applying edit e i on the input token x i at each position i reconstructs the output sequence y. Invoking an off-the-shelf edit distance algorithm between x and y can give us a sequence of copy, delete, replace, and insert operations of arbitrary length. The main difficulty is converting the insert operations into in-place edits at each x i . Other parallel models (Ribeiro et al., 2018; have used methods like predicting insertion slots in a pre-processing step, or predicting zero or more tokens in-between any two tokens in x. We will see in Section 3.1.4 that these options do not perform well. Hence we design an alternative edit space E that merges inserts with preceding edit operations creating compound append or replace operations. Further, we create a dictionary Σ a of common q-gram insertions or replacements observed in the training data. Our edit space (E) comprises of copy (C) x i , delete Require: x = (x1, . . . , xn), y = (y1, . . . , ym) and T : List of Transformations diffs ← LEVENSHTEIN-DIST(x, y) with modified cost. diffs ← In diffs break substitutions, merge inserts into qgrams Σa ← M most common inserts in training data e ← EMPTYARRAY(n) Figure 1: The Seq2Edits function used to convert a sequence pair x, y into in-place edits.
For GEC, we additionally use transformations denoted as T 1 , . . . , T k which perform word-inflection (e.g. arrive to arrival). The space of all edits is thus: He still won race ! ] We present our algorithm for converting a sequence x and y into in-place edits on x using the above edit space in Figure 1. Table 1 gives examples of converting (x, y) pairs to edit sequences. We first invoke the Levenshtein distance algorithm (Levenshtein, 1966) to obtain diff between x and y with delete and insert cost as 1 as usual, but with a modified substitution cost to favor matching of related words. We detail this modified cost in the Appendix and show an example of how this modification leads to more sensible edits. The diff is post-processed to convert substitutions into deletes followed by inserts, and consecutive inserts are merged into a q-gram. We then create a dictionary Σ a of the M most frequent q-gram inserts in the training set. Thereafter, we scan the diff left to right: a copy at x i makes e i = C, a delete at x i makes e i = D, an insert w at x i and a e i = C flips the e i into a e i = A(w) if w is in Σ a , else it is dropped, an insert w at x i and a e i = D flips the e i into a e i = T(w) if a match found, else a replace R(w) if w is in Σ a , else it is dropped.
The above algorithm does not guarantee that when e is applied on x we will recover y for all sequences in the training data. This is because we limit Σ a to include only the M most frequently inserted q-grams. For local sequence transduction tasks, we expect a long chain of consecutive inserts to be rare, hence our experiments were performed with q = 2. For example, in NUCLE dataset (Ng et al., 2014) which has roughly 57.1K sentences, less than 3% sentences have three or more consecutive inserts.

The Parallel Edit Prediction Model
We next describe our model for predicting edits e : e 1 , . . . , e n on an input sequence x : x 1 , . . . , x n . We use a bidirectional encoder to provide a contextual encoding of each x i . This can either be multiple layers of bidirectional RNNs, CNNs or deep bidirectional transformers. We adopt the deep bidirectional transformer architecture since it encodes the input in parallel. We pre-train the model using L much like in BERT pre-training recently proposed for language modeling (Devlin et al., 2018). We first give an overview of BERT and then describe our model.
Background: BERT The input is token embedding x i and positional embedding p i for each token x i in the sequence x. Call these jointly as Each layer produces a h i at each position i as a function of h −1 i and self-attention The BERT model is pretrained by masking out a small fraction of the input tokens by a special MASK, and predicting the masked word from its bidirectional context captured as the last layer output: h 1 , . . . , h n .

A Default use of BERT
Since we have cast our task as a sequence labeling task, a default output layer would be to compute Pr(e i |x) as a softmax over the space of edits E from each h i . If W e denotes the softmax parameter for edit e, we get: The softmax parameters, W e in Equation 2, have to be trained from scratch. We propose a method to exploit token embeddings of the pre-trained language model to warm-start the training of edits like appends and replaces which are associated with a token argument. Furthermore, for appends and replaces, we provide a new method of computing the hidden layer output via alternative input positional embeddings and self-attention. We do so without introducing any new parameters in the hidden layers of BERT.

An Edit-factorized BERT Architecture
We adapt a pre-trained BERT-like bidirectional language model to learn to predict edits as follows.
For suitably capturing the contexts for replace edits, for each position i we create an additional input comprising of r 0 i = [M, p i ] where M is the embedding for MASK token in the LM. Likewise for a potential insert between i and i + 1 we create an where the second component is the average of the position embedding of the ith and i + 1th position. As shown in Figure 2, at each layer we compute self-attention for the ith replace unit r i over h j for all j = i and itself. Likewise, for the append unit a i the self-attention is over all h j for all js and itself. At the last layer we have h 1 , . . . , h n , r 1 , . . . , r n , a 1 , . . . , a n . Using these we compute logits factorized over edits and their token argument. For an edit e ∈ E at position i, let w denote the argument of the edit (if any). As mentioned earlier, w can be a q-gram for append and replace edits. Embedding of w, represented by φ(w) is obtained by summing up individual output embeddings of tokens in w. Additionally, in the outer layer we allocate edit-specific parameters θ corresponding to each distinct command in E. Using these, various edit logits are computed as follows: The first term in the RHS of above equations captures edit specific score. The second term captures the score for copying the current word x i to the output. The third term models the influence of a new incoming token in the output obtained by a replace or append edit. For replace edits, score of the replaced word is subtracted from score of the incoming word. For transformation we add the copy score because they typically modify only the word forms, hence we do not expect meaning of the transformed word to change significantly.
The above equation provides insights on why predicting independent edits is easier than predicting independent tokens. Consider the append edit (A(w)). Instead of independently predicting x i at i and w at i+1, we jointly predict the tokens in these two slots and contrast it with not inserting any new w after x i in a single softmax. We will show empirically (Sec 3.1.4) that such selective joint prediction is key to obtaining high accuracy in spite of parallel decoding.
Finally, loss for a training example (e, x) is obtained by summing up the cross-entropy associated with predicting edit e i at each token x i .

Experiments
We compare our parallel iterative edit (PIE) model with state-of-the-art GEC models that are all based on attentional encoder-decoder architectures. In Section 3.2 we show that the PIE model is also effective on two other local sequence transduction tasks: spell and OCR corrections. Hyperparameters for all the experiments are provided in Table 12, 13 of the Appendix.

Grammatical error correction (GEC)
We use Lang-8 (Mizumoto et al., 2011) (Zhu et al., 2015) (800M words) to predict 15% randomly masked words using its deep bidirectional context. Next, we perform 2 epochs of training on a synthetically perturbed version of the One-Billion-word corpus (Chelba et al., 2013). We refer to this as synthetic training. Details of how we create the synthetic corpus appear in Section A.3 of the appendix. Finally we fine-tune on the real GEC training corpus for 2 epochs. We use a batch size of 64 and learning rate 2e-5. The edit space consists of copy, delete, 1000 appends, 1000 replaces and 29 transformations and their inverse. Arguments of Append and Replace operations mostly comprise punctuations, articles, pronouns, prepositions, conjunctions and verbs. Transformations perform inflections like add suffix s, d, es, ing, ed or replace suffix s to ing, d to s, etc. These transformations were chosen out of common replaces in the training data such that many replace edits map to only a few transformation edits, in order to help the model better generalize replaces across different words. In Section 3.1.4 we see that transformations increase the model's recall. The complete list of transformations appears in Table 10 of the Appendix. We evaluate on F 0.5 score over spanlevel corrections from the MaxMatch (M2) scorer (Dahlmeier and Ng, 2012) on CONLL-2014-test. Like most existing GEC systems, we invoke a spell-checker on the test sentences before applying our model. We also report GLEU + (Napoles et al., 2016) scores on JFLEG corpus (Napoles et al., 2017) to evaluate fluency.

Overall Results
Table 3 compares ensemble model results of PIE and other state of the art models, which all happen to be seq2seq ED models and also use ensemble decoding. For PIE, we simply average the probability distribution over edits from 5 independent ensembles. In Table-2 we compare non-ensemble numbers of PIE with the best available non-ensemble numbers of competing methods. On CoNLL-14 test-set our results are very close to the highest reported by Zhao et al. (2019). These results show that our parallel prediction model is competitive without incurring the overheads of beam-search and slow decoding of sequential models. GLEU + score, that rewards fluency, is somewhat lower for our model on the JF-LEG test set because of parallel predictions. We do not finetune our model on the JFLEG dev set. We expect these to improve with re-ranking using a LM. All subsequent ablations and timing measurements are reported for non-ensemble models.

Running Time Comparison
Parallel decoding enables PIE models to be considerably faster than ED models. In Figure 3 we compare wall-clock decoding time of PIE with 24 encoder layers (PIE-LARGE, F 0.5 = 59.7), PIE with 12 encoder layers (PIE-BASE, F 0.5 = 56.6) and competitive ED architecture by Lichtarge et al. (2019) with 6 encoder and 6 decoder layers (T2T, F 0.5 = 56.8 ) on CoNLL-14 test set. All decoding experiments were run and measured on a n1-standard-2 2 VM instance with a single TPU shard (v-2.8). We observe that even PIE-LARGE is between a factor of 5 to 15 faster than an equivalent transformer-based ED model (T2T) with beam-size 4. The running time of PIE-LARGE increases sub-linearly with sentence length whereas the ED model's decoding time increases linearly.

Impact of Iterative Refinement
We next evaluate the impact of iterative refinements on accuracy in Table 4. Out of 1312 sentences in the test set, only 832 sentences changed in the first round which were then fed to the second round where only 156 sentences changed, etc. The average number of refinement rounds per example was 2.7. In contrast, a sequential model on this dataset would require 23.2 steps corresponding to the average number of tokens in a sentence. The F 0.5 score increases from 57.9 to 59.5 at the end of the second iteration. Table 5 presents some sentences corrected by PIE. We see that PIE makes multiple parallel edits in a round if needed. Also, we see how refinement over successive iterations captures output space dependency. For example, in the second sentence interact gets converted to interacted followed by insertion of have in the next round.

Ablation study on the PIE Architecture
In this section we perform ablation studies to understand the importance of individual features of the PIE model. Synthetic Training We evaluate the impact of training on the artifically generated GEC corpus in row 2 of Table 6. We find that without it the F 0.5 score is 3.4 points lower. Factorized Logits We evaluate the gains due to   our edit-factorized BERT model (Section 2.3.2) over the default BERT model (Section 2.3.1). In Table 6 (row 3) we show that compared to the factorized model (row 2) we get a 1.2 point drop in F 0.5 score in absence of factorization.
Inserts as Appends on the preceding word was another important design choice. The alternative of predicting insert independently at each gap with a null token added to Σ a performs 2.7 F 0.5 points poorly (Table 6 row 4 vs row 2).
Transformation edits are significant as we observe a 6.3 drop in recall without them (row 5).

Impact of Language Model
We evaluate the benefit of starting from BERT's pre-trained LM by reporting accuracy from an un-initialized network (row 6). We observe a 20 points drop in F 0.5 establishing the importance of LMs in GEC.

Impact of Network Size
We train the BERT-Base model with one-third fewer parameters than BERT-LARGE. From

More Sequence Transduction Tasks
We demonstrate the effectiveness of PIE model on two additional local sequence transduction tasks recently used in (Ribeiro et al., 2018).

Spell Correction
We use the twitter spell correction dataset (Aramaki, 2010) which consists of 39172 pairs of original and corrected words obtained from twitter. We use the same train-dev-valid split as (Ribeiro et al., 2018) (31172/4000/4000). We tokenize on characters, and our vocabulary Σ and Σ a comprises the 26 lower cased letters of English.
Correcting OCR errors We use the Finnish OCR data set 3 by (Silfverberg et al., 2016) comprising words extracted from Early Modern   Finnish corpus of OCR processed newspaper text. We use the same train-dev-test splits as provided by (Silfverberg et al., 2016). We tokenize on characters in the word. For a particular split, our vocabulary Σ and Σ a comprises of all the characters seen in the training data of the split.
Architecture For all the tasks in this section, PIE is a 4 layer self-attention transformer with 200 hidden units, 400 intermediate units and 4 attention heads. No L pre-initialization is done. Also, number of iterations of refinements is set to 1.
Results Table 7 presents whole-word 0/1 accuracy for these tasks on PIE and the following methods: Ribeiro et al. (2018)'s local transduction model (described in Section 4), and LSTM based ED models with hard monotonic attention (Aharoni and Goldberg, 2017) and soft-attention (Bahdanau et al., 2015) as reported in (Ribeiro et al., 2018  Approaches attempted so far include rules (Felice et al., 2014), classifiers (Rozovskaya and Roth, 2016), statistical machine translation (SMT) (Junczys-Dowmunt and Grundkiewicz, 2016), neural ED models (Chollampatt and Ng, 2018a;Ge et al., 2018a), and hybrids . All recent neural approaches are sequential ED models that predict either word sequences (Zhao et al., 2019;Lichtarge et al., 2019) or character sequences (Xie et al., 2016) using either multi-layer RNNs (Ji et al., 2017; or CNNs (Chollampatt and Ng, 2018a;Ge et al., 2018a) or Transformers Lichtarge et al., 2019). Our sequence labeling formulation is similar to (Yannakoudakis et al., 2017) and (Kaili et al., 2018) but the former uses it to only detect errors and the latter only corrects five error-types using separate classifiers. Edits have been exploited in earlier GEC systems too but very unlike our method of re-architecting the core model to label input sequence with edits. Schmaltz et al. (2017) interleave edit tags in target tokens but use seq2seq learning to predict the output sequence. Chollampatt and Ng (2018a) use edits as features for rescoring seq2seq predictions.  use an edit-weighted MLE objective to emphasise corrective edits during seq2seq learning. Stahlberg et al. (2019) use finite state transducers, whose state transitions denote possible edits, built from an unlabeled corpus to constrain the output of a neural beam decoder to a small GEC-feasible space.
Parallel decoding in neural machine translation Kaiser et al. (2018) achieve partial parallelism by first generating latent variables sequentially to model dependency. Stern et al. (2018) use a parallel generate-and-test method with modest speed-up. Gu et al. (2018) generate all tokens in parallel but initialize decoder states using latent fertility variables to determine number of replicas of an encoder state. We achieve the effect of fertility using delete and append edits.  generate target sequences iteratively but require the target sequence length to be predicted at start. In contrast our in-place edit model allows target sequence length to change with appends.
Local Sequence Transduction is handled in Ribeiro et al. (2018) by first predicting insert slots in x using learned insertion patterns and then using a sequence labeling task to output tokens in x or a special token delete. Instead, we output edit operations including word transformations. Their pattern-based insert pre-slotting is unlikely to work for more challenging tasks like GEC. Koide et al. (2018) design a special editinvariant neural network for being robust to small edit changes in input biological sequences. This is a different task than ours of edit prediction. Yin et al. (2019) is about neural representation of edits specifically for structured objects like source code. This is again a different problem than ours.

Conclusion
We presented a parallel iterative edit (PIE) model for local sequence transduction with a focus on the GEC task. Compared to the popular encoderdecoder models that perform sequential decoding, parallel decoding in the PIE model yields a factor of 5 to 15 reduction in decoding time. The PIE model employs a number of ideas to match the accuracy of sequential models in spite of parallel decoding: it predicts in-place edits using a carefully designed edit space, iteratively refines its own predictions, and effectively reuses state-ofthe-art pre-trained bidrectional language models. In the future we plan to apply the PIE model to more ambitious transduction tasks like translation.