FELIX: Flexible Text Editing Through Tagging and Insertion

We present FELIX – a flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pretraining. In contrast to conventional sequenceto-sequence (seq2seq) models, FELIX is efficient in low-resource settings and fast at inference time, while being capable of modeling flexible input-output transformations. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input. The tagging model employs a novel Pointer mechanism, while the insertion model is based on a Masked Language Model (MLM). Both of these models are chosen to be non-autoregressive to guarantee faster inference. FELIX performs favourably when compared to recent text-editing methods and strong seq2seq baselines when evaluated on four NLG tasks: Sentence Fusion, Machine Translation Automatic Post-Editing, Summarization, and Text Simplification


Introduction
The ideas of text in-filling coupled with selfsupervised pre-training of deep Transformer networks on large text corpora have dramatically changed the landscape in Natural Language Understanding. BERT (Devlin et al., 2019) and its successive refinements RoBERTa , ALBERT (Lan et al., 2019) implement this recipe and have significantly pushed the state-of-the-art on multiple NLU benchmarks such as GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016). More recently, masked or in-filling style objectives for model pretraining have been applied to seq2seq tasks, significantly pushing the state-of-the-art * Equal contribution.  on a number of text generation tasks, e.g, KER-MIT , MASS (Song et al., 2019), Bert2Bert (Rothe et al., 2020), BART (Lewis et al., 2020) and T5 (Raffel et al., 2019).
While seq2seq frameworks offer a generic tool for modeling almost any kind of text-to-text transduction, there are still many real-world tasks where generating target texts completely from scratchas is done with seq2seq approaches-can be unnecessary. This is especially true for monolingual settings where input and output texts have relatively high degrees of overlap. In such cases a natural approach is to cast conditional text generation as a text-editing task, where the model learns to reconstruct target texts by applying a set of edit operations to the inputs. Typically, the set of edit operations is fixed and pre-defined ahead of time, which on one hand limits the flexibility of the model to reconstruct arbitrary output texts from their inputs, but on the other leads to higher sample-efficiency as the limited set of allowed operations significantly reduces the search space. Based on this observation, text-editing approaches have recently re-gained sig-nificant interest (Gu et al., 2019;Dong et al., 2019;Awasthi et al., 2019;. In this paper we present a novel text-editing framework, FELIX, which is heavily inspired by the ideas of bi-directional decoding (slot in-filling) and self-supervised pre-training. In particular, we have designed FELIX with the following requirements: Sample efficiency. Training a high-precision text generation model typically requires large amounts of high-quality supervised data. Self-supervised techniques based on text in-filling have been shown to provide a crucial advantage in low-resource settings. Hence, we focus on approaches able to benefit from already existing pre-trained language models such as BERT, where the final model is directly fine-tuned on the downstream task. We show that this allows us to train on as few as 450 datapoints.
Fast inference time. Achieving low latencies when serving text-generation models typically requires specialized hardware and finding a trade-off between model size and accuracy. One major reason for slow inference times is that text-generation models typically employ an autoregressive decoder, i.e., output texts are generated in a sequential non-parallel fashion. To ensure faster inference times we opt for keeping FELIX fully non-autoregressive, resulting in two orders of magnitude speedups.
Flexible text editing. While simplifying the learning task, text-editing models are not as powerful as general purpose sequence-to-sequence approaches when it comes to modeling arbitrary inputoutput text transductions. Hence, we strive to strike a balance between the complexity of learned edit operations and the percentage of input-output transformations the model can capture.
We propose to tackle text editing by decomposing it into two sub-problems: tagging and insertion (see Fig. 1). Our tagger is a Transformer-based network that implements a novel Pointing mechanism (Vinyals et al., 2015). It decides which source tokens to preserve and in which order they appear in the output, thus allowing for arbitrary word reordering.
Target words not present in the source are represented by the generic slot predictions to be infilled by the insertion model. To benefit from self-supervised pre-training, we chose our insertion model to be fully compatible with the BERT archi-tecture, such that we can easily re-use a publiclyavailable pre-trained checkpoint.
By decomposing text-editing tasks in this way we redistribute the complexity load of generating an output text between the two models: the source text already provides most of the building blocks required to reconstruct the target, which is handled by the tagging model. The missing pieces are then in-filled by the insertion model, whose job becomes much easier as most of the output text is already in place. Moreover, such a two-step approach is the key for being able to use completely non-autoregressive decoding for both models and still achieve competitive results compared to fully autoregressive approaches.
We evaluate FELIX on four distinct text generation tasks: Sentence Fusion, Text Simplification, Summarization, and Automatic Post-Editing for Machine Translation and compare it to recent text-editing and seq2seq approaches. Each task is unique in the editing operations required and the amount of training data available, which helps to better quantify the value of solutions we have integrated into FELIX 1 .

Model description
FELIX decomposes the conditional probability of generating an output sequence y from an input x as follows: where the two terms correspond to the tagging and the insertion model. Term y t corresponds to the output of the tagging model and consists of a sequence of tags assigned to each input token x and a permutation π, which reorders the input tokens. Term y m denotes an intermediate sequence with masked spans and is fed into the insertion model. Given this factorization, both models can be trained independently.

Tagging Model
The tagging model is composed of three steps: (1) Encoding, the source sentence is first encoded using a 12-layer BERT-base model. (2) Tagging, a tagger is applied on top of the encoder and tags each source token.
(3) Pointing, a pointer network, using attention applied to the encoders hidden states, re-orders the source tokens. FELIX is trained to optimize both the tagging and pointing loss: where λ is a hyperparameter.
Tagging. The tag sequence y t is constructed as follows: source tokens that must be copied are assigned the KEEP tag, tokens not present in the output are marked by the DELETE tag, token spans present in the output but missing from the input are modeled by the INSERT (INS) tag. This tag is then converted into masked token spans in-filled by the insertion model. Tags are predicted by applying a single feedforward layer f to the output of the encoder h L . We define: p(y t |x) = i p(y t i |x), where i is the index of the source token. The model then is trained to minimize the cross-entropy loss. During decoding we use argmax to determine the tags, Pointing. FELIX explicitly models word reordering to allow for larger global edits, as well as smaller local changes, such as swapping nearby words, John and Mary → Mary and John. Without this word reordering step a vanilla editing model based on just tagging such as Dong et al., 2019), would first need to delete a span (and Mary) and then insert Mary and before John.
FELIX is able to model this without the need for deletions or insertions. Given a sequence x and the predicted tags y t , the re-ordering model generates a permutation π so that from π and y t we can reconstruct the insertion model input y m . Thus we have: [CLS] The big very loud cat root Figure 3: Pointing mechanism to transform "the big very loud cat" into "the very big cat".
We highlight that each π(i) is predicted independently, non auto-autoregressivly. The output of this model is a series of predicted pointers (source token → next target token). y m can easily be constructed by daisy-chaining the pointers together, as seen in Fig. 3. As highlighted by this figure, FELIX's reordering process is similar to non-projective dependency parsing Dozat and Manning (2017), where head relationships are non-autoregressively predicted to form a tree. Similarly FELIX predicts next word relationship and instead forms a sequence.
Our implementation is based on a pointer network (Vinyals et al., 2015), where an attention mechanism points to the next token. Unlike previous approaches where a decoder state attends over an encoder sequence, our setup applies intraattention, where source tokens attend to all other source tokens.
The input to the Pointer layer at position i is a combination of the encoder hidden state h L i , the embedding of the predicted tag e(y t i ) and the positional embedding e(p i ) 2 as follows: . The pointer network attends over all hidden states, as such: Attention between hidden states is calculated using a query-key network with a scaled dot-product: where K and Q are linear projections of h L+1 and d k is the hidden dimension. We found the optional inclusion of an additional Transformer layer prior to the query projection increased the performance on movement-heavy datasets. The model is trained to minimize cross-entropy loss of the pointer network.
To realize the pointers, we use constrained beam search (Post and Vilar, 2018). Like Figure 3, we create the output by daisy chaining pointers, starting with [CLS], and finding the most probable pointer path, a token at a time. We ensure no loops are formed by preventing source token from being pointed to twice, and ensure that all source tokens not tagged with delete are pointed to 3 . We note that when using argmax, loops are only form in < 3% of the cases.
Dataset construction. When constructing the training dataset, there are many possible combinations of π and y t which could produce y. For instance, all source tokens could be replaced by MASK tokens. However, we wish to minimize the number of edits, particularly minimizing the amount of inserted tokens. To do so we greedily apply the following rules, iterating through the target tokens: 1. If the target token appears within the source sentence, point to it and tag it with keep. In the case, the target token appears multiple times in the source sentence, point to the nearest source token, as determined by the previously pointed to source token.
2. If a source token is already pointed to, then it cannot be pointed to again. 4

Insertion Model
An input to the insertion model y m contains a subset of the input tokens in the order determined by the tagging model, as well as masked token spans that it needs to in-fill.
To represent masked token spans we consider two options: masking and infilling (see Fig. 2). In the former case the tagging model predicts how many tokens need to be inserted by specializing the INSERT tag into INS k, where k translates the span into k MASK tokens.
For the infilling case the tagging model predicts a generic INS tag, which signals the insertion model to infill it with a span of tokens of an arbitrary length. If we were to use an autoregressive insertion model, the natural way to model it would be to run the decoder until it decides to stop by producing a special stop symbol. Since by design we opted for using a non-autoregressive model, to represent variable-length insertions we use a PAD symbol to pad all insertions to a fixed-length 5 sequence of MASK tokens.
Note that we preserve the deleted span in the input to the insertion model by enclosing it between [REPL] and [/REPL] tags. Even though this introduces an undesired discrepancy between the pretraining and fine-tuning data that the insertion model observes, we found that making the model aware of the text it needs to replace significantly boosts the accuracy of the insertion model.
FELIX as Insertion Transformer. Another intuitive way to picture how FELIX works is to draw a connection with Insertion Transformer . In the latter, the decoder starts with a blank output text (canvas) and iteratively infills it by deciding which token and in which position should appear in the output. Multiple tokens can be inserted at a time thus achieving sub-linear decoding times. In contrast, FELIX trains a separate tagger model to pre-fill 6 the output canvas with the input tokens in a single step. As the second and final step FELIX does the insertion into the slots predicted by the tagger. This is equivalent to a single decoding step of the Insertion Transformer. Hence, FELIX requires significantly fewer (namely, two) decoding steps than Insertion Transformer, and through the tagging/insertion decomposition of the task it is straightforward to directly take advantage of existing pre-trained MLMs.
Similar to the tagger, our insertion model is also based on a 12-layer BERT-base and is initialized from a public pretrained checkpoint.
When using the masking approach, the insertion model is solving a masked language modeling task and, hence, we can directly take advantage of the BERT-style pretrained checkpoints. This is a considerable advantage, especially in the low-resource settings, as we do not waste training data on learning a language model component of the text-editing model 7 . With the task decomposition where tagging and insertion can be trained disjointly it essentially comes for free.
Switching from masking to infilling shifts the complexity of modeling the length of inserted token spans from the tagging model to the insertion model. Depending on the amount of training data available it provides interesting trade-offs between the accuracy of the tagging and insertion models. We compare these approaches in Sec. 3.4; for all other tasks we use the masking approach.

Experiments
We evaluate FELIX on four distinct text editing tasks: Sentence Fusion, Text Simplification, Summarization, and Automatic Post-Editing for Machine Translation. In addition to reporting previously published results for each task 8 , we also compare to a recent text-editing approach LASERTAG-GER , which combines editing operations with a fixed vocabulary of additional phrases which can be inserted. We follow their setup and set the phrase vocabulary size to 500 and run all experiments using their most accurate autoregressive model. To decode a batch of 32 on a Nvidia Tesla P100, LASERTAGGER takes 1,300ms, FELIX takes 300ms and a a similarly sized seq2seq model takes 27,000ms .
For all tasks we run an ablation study, examining the effect of an open vocabulary with no reordering (FELIXINSERT), and a fixed vocabulary 9 with reordering model (FELIXPOINT).
Task analysis. The chosen tasks cover a diverse set of edit operations and a wide range of dataset sizes. Table 1 provides dataset statistics including: the size, sentence length, and the translation error rate (TER) (Snover et al., 2006) between the source and target sentences. We use TER to highlight unique properties of each task. The summarization dataset is a deletion-heavy dataset, with the highest number of deletion edits and the largest reduction in sentence length. It contains moderate amounts of substitutions and a large number of shift edits, caused by sentence re-ordering. Both the simplification and post-editing datasets contain a large number of insertions and substitutions, while simplification contains a greater number of deletion edits. Post-editing, however, is a much larger dataset covering multiple languages. Sentence fusion has the lowest TER, indicating that obtaining the fused targets requires only a limited number of local edits. However, these edits require modeling the discourse relation between the two input sentences, since a common edit type is predicting the correct discourse connective (Geva et al., 2019). Additionally, within Table 2 we provide coverage statistics (the percentage of training instances for which an editing model can fully reconstruct the output) and MASK percentages (the percentage of output tokens which the insertion model must predict). As both FELIX and FELIX-INSERT use an open vocabulary, they cover 100% of the data, whereas FELIXPOINT and LASERTAG-GER often cover less than half. For every dataset FELIXPOINT covers a significantly higher percentage than LASERTAGGER, with the noticeable case being summarization, where there is a 3x increase in coverage. This can be explained by the high number of shift edits within summarization (Table 1), something FELIXPOINT is explicitly designed to model. We found that the difference in coverage between FELIXPOINT and LASERTAGGER correlates strongly (correlation 0.99, p<0.001) with the number of shift edits. Comparing MASK percentages, we see that FELIX always inserts (∼50%) fewer MASKs than FELIXINSERT.

Summarization
Summarization is the task that requires systems to shorten texts in a meaning-preserving way.
Data. We use the dataset from (Toutanova et al., 2016), which contains 6,168 short input texts (one or two sentences) and one or more human-written   summaries, resulting in 26,000 total training pairs. The human experts were not restricted to just deleting words when generating a summary, but were allowed to also insert new words and reorder parts of the sentence.
Metrics. We report SARI (Xu et al., 2016), which computes the average F1 scores of the added, kept, and deleted n-grams, as well as breaking it down into each component KEEP, DELETE, and ADD, as we found the scores were uneven across these metrics. We also include ROUGE-L and BLEU-4, as these metrics are commonly used in the summarization literature.
Results. In Table 3 we compare against LASERTAGGER and SEQ2SEQ BERT from ), a seq2seq model initialized using BERT. The results show that FELIX achieves the highest SARI, ROUGE and BLEU scores. All ablated models achieve higher SARI scores than all other models.

Simplification
Sentence simplification is the problem of simplifying sentences such that they are easier to understand. Simplification can be both lexical, replacing or deleting complex words; or syntactic, replacing complex syntactic constructions.
Data. Training is performed on WikiLarge, (Zhang and Lapata, 2017a) a large simplification corpus which consists of a mixture of three Wikipedia simplification datasets collected by (Kauchak, 2013;Woodsend and Lapata, 2011;Zhu et al., 2010). The test set was created by Xu et al. (2016) and consists of 359 source sentences taken from Wikipedia, and then simplified using Amazon Mechanical Turkers to create eight references per source sentence.
Metrics. We report SARI, a readability metric FleschKincaid grade level (FKGL), and the percentage of unchanged source sentences (copy).
Results. In Table 4    FELIX achieves the highest overall SARI score and the highest SARI-KEEP score. In addition, all ablated models achieve higher SARI scores than LASERTAGGER. While FELIXINSERT achieves a higher SARI score than EditNTS, FELIXPOINT does not; this can in part be explained by the large number of substitutions and insertions within this dataset, with FELIXPOINT achieving a low SARI-ADD score.

Post-Editing
Automatic Post-Editing (APE) is the task of automatically correcting common and repetitive errors found in machine translation (MT) outputs.
Data. APE approaches are trained on triples: the source sentence, the machine translation output, and the target translation. We experiment on the WMT17 EN-DE IT post-editing task 10 , where the goal is to improve the output of an MT system that translates from English to German and is applied to documents from the IT domain. We follow the procedures introduced in (Junczys-Dowmunt and Grundkiewicz, 2016) and train our models using two synthetic corpora of 4M and 500K examples merged with a corpus of 11K real examples oversampled 10 times. The models that we study expect a single input string. To obtain this and to give the models a possibility to attend to the English source text, we append the source text to the German translation. Since the model input consists of two different languages, we use the multilingual Cased BERT checkpoint for FELIX and LASERTAGGER.
Metrics. We follow the evaluation procedure of WMT17 APE task and use TER as the primary metric and BLEU as a secondary metric. 10 http://statmt.org/wmt17/ape-task.html  Results. We consider the following baselines: COPY, which is a competitive baseline given that the required edits are typically very limited; LASERTAGGER ; LEVEN-SHTEIN TRANSFORMER (LEVT) (Gu et al., 2019), a partially autoregressive model that also employs deletion and insertion mechanisms; a standard TRANSFORMER evaluated by (Gu et al., 2019); and a state-of-the-art method by . Unlike the other methods, the last baseline is tailored specifically for the APE task by encoding the source separately and conditioning the MT output encoding on the source encoding . Results are shown in Table 5. First, we can see that using a custom method  brings significant improvements over generic text transduction methods. Second, FELIX performs very competitively, yielding comparative results to LEVT which is a partially autoregressive model, and outperforming the other generic models in terms of TER. Third, FELIXINSERT performs considerably worse than FELIX and FELIXPOINT, suggesting that the pointing mechanism is important for the APE task. This observation is further supported by Table 2 which shows that without the pointing mechanism the average proportion of masked tokens in a target is 42.39% whereas with pointing it is only 17.30%. This suggests that, removing the pointing mechanism shifts the responsibility too heavily from the tagging model to the insertion model.

Sentence Fusion
Sentence Fusion is the problem of fusing independent sentences into a coherent output sentence(s).

points.
Metrics. Following Geva et al. (2019), we report two metrics: Exact score, which is the percentage of exactly correctly predicted fusions, and SARI.
Results. Table 6 includes additional BERT-based seq2seq baselines: SEQ2SEQ BERT and BERT2BERT from (Rothe et al., 2020). For all FELIX variants we further break down the scores based on how the INSERTION is modelled: via token-masking (Mask) or Infilling (Infill). Additionally, to better understand the contribution of tagging and insertion models to the final accuracy, we report scores assuming oracle insertion and tagging predictions respectively (highlighted rows).
The results show that FELIX and its variants significantly outperform the baselines LASERTAGGER and SEQ2SEQ BERT , across all data conditions. Under the 100% condition BERT2BERT achieves the highest SARI and Exact score, however for all other data conditions FELIX outperforms BERT2BERT. Both seq2seq models perform poorly with less than 4500 (0.1%) datapoints, whereas all editing models achieve relatively good performance.
When comparing FELIX variants we see on the full dataset FELIXINSERT outperforms FELIX, however we note that for FELIXINSERT we followed  and used an additional sentence re-ordering tag, a hand crafted feature tailored to DiscoFuse which swaps the sentence order. It was included in  and resulted in a significant (6% Exact score) increase. However, in the low resource setting, FELIX outperforms FELIXINSERT, suggesting that FELIX is more data efficient than FELIXINSERT.
Ablation. We first contrast the impact of the insertion model and the tagging model, noticing that for all models Infill achieves better tagging scores and worse insertion scores than Mask. Secondly, FELIX achieves worse tagging scores but better insertion scores than FELIXINSERT. This highlights the amount of pressure each model is doing, by making the tagging task harder, such as the inclusion of reordering, the insertion task becomes easier. Finally, the insertion models, even under very low data conditions, achieve impressive performance. This suggests that under low data conditions most pressure should be applied to the insertion model.

Related work
Seq2seq models (Sutskever et al., 2014) have been applied to many text generation tasks that can be cast as monolingual translation, but they suffer from well-known drawbacks (Wiseman et al., 2018): they require large amounts of training data, and their outputs are difficult to control. Whenever input and output sequences have a large overlap, it is reasonable to cast the problem as a text editing task, rather than full-fledged sequence-to-sequence generation. Ribeiro et al. (2018) argued that the general problem of string transduction can be re-duced to sequence labeling. Their approach applied only to character deletion and insertion and was based on simple patterns. LaserTagger  is a general approach that has been shown to perform well on a number of text editing tasks, but it has two limitations: it does not allow for arbitrary reordering of the input tokens; and insertions are restricted to a fixed phrase vocabulary that is derived from the training data. Similarly, Ed-itNTS (Dong et al., 2019) and PIE (Awasthi et al., 2019) are two other text-editing models developed specifically for simplification and grammatical error correction, respectively.
Pointer networks have been previously proposed as a way to copy parts of the input in hybrid seq2seq models. Gulcehre et al. (2016) and  trained a pointer network to specifically deal with out-of-vocabulary words or named entities. Chen and Bansal (2018) proposed a summarization model that first selects salient sentences and then rewrites them abstractively, using a pointer mechanism to directly copy some out-of-vocabulary words.
Previous approaches have proposed alternatives to autoregressive decoding (Gu et al., 2018;Lee et al., 2018;Wang and Cho, 2019). Instead of the left-to-right autoregressive decoding, Insertion Transformer  and BLM (Shen et al., 2020) generate the output sequence through insertion operations, whereas LEVT (Gu et al., 2019) additionally incorporates a deletion operation. These methods produce the output iteratively, while FELIX requires only two steps: tagging and insertion.
The differences between the proposed model, FELIX, its ablated variants, and a selection of related works is summarized in Table 7.

Conclusions and Future Work
We have introduced FELIX, a novel approach to text editing, by decomposing the task into tagging and insertion which are trained independently. Such separation allows us to take maximal benefit from the already existing pretrained masked-LM models. FELIX works extremely well in low-resource settings and it is fully non-autoregressive which favors faster inference. Our empirical results demonstrate that it delivers highly competitive performance when compared to strong seq2seq baselines and other recent text editing approaches.
In the future we plan to investigate the following  Table 7: Model comparison along five dimensions: model type, whether the model: is non-autoregressive (LEVT is partially autoregressive), uses a pretrained checkpoint, uses a word reordering mechanism (T5 uses a reordering pretraining task but does not have a copying mechanism), able to generate any possible output (Open vocab).
ideas: (i) how to effectively share representations between the tagging and insertion models using a single shared encoder, (ii) how to perform joint training of insertion and tagging models instead of training them separately, (iii) strategies for unsupervised pre-training of the tagging model. which appears to be the bottleneck in highly low-resource settings, and (iv) distillations recipes.