Encode, Tag, Realize: High-Precision Text Editing

We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.


Introduction
Neural sequence-to-sequence (seq2seq) models provide a powerful framework for learning to translate source texts into target texts. Since their first application to machine translation (MT) (Sutskever et al., 2014) they have become the de facto approach for virtually every text generation task, including summarization (Tan et al., 2017), image captioning (Xu et al., 2015), text style transfer (Rao and Tetreault, 2018;Nikolov and Hahnloser, 2018;Jin et al., 2019), and grammatical error correction (Chollampatt and Ng, 2018;Grundkiewicz et al., 2019).
We observe that in some text generation tasks, such as the recently introduced sentence splitting and sentence fusion tasks, output texts highly overlap with inputs. In this setting, learning a seq2seq model to generate the output text from scratch seems intuitively wasteful. Copy mechanisms (Gu et al., 2016;See et al., 2017) allow for choosing between copying source tokens and generating arbitrary tokens, but although such hybrid models help with out-of-vocabulary words, they still require large training sets as they depend on output vocabularies as large as those used by the standard seq2seq approaches. In contrast, we propose learning a text editing model that applies a set of edit operations on the input sequence to reconstruct the output. We show that it is often enough to use a relatively small set of output tags representing text deletion, rephrasing and word reordering to be able to reproduce a large percentage of the targets in the training data. This results in a learning problem with a much smaller vocabulary size, and the output length fixed to the number of words in the source text. This, in turn, greatly reduces the number training examples required to train accurate models, which is particularly important in applications where only a small number of human-labeled data is available.
Our tagging approach, LASERTAGGER, consists of three steps ( Fig. 1): (i) Encode builds a representation of the input sequence, (ii) Tag assigns edit tags from a pre-computed output vocabulary to the input tokens, and (iii) Realize applies a simple set of rules to convert tags into the output text tokens.
An experimental evaluation of LASERTAGGER on four different text generation tasks shows that it yields comparable results to seq2seq models when we have tens of thousands of training examples and clearly outperforms them when the number of examples is smaller. Our contributions are the following: 1) We demonstrate that many text generation tasks with overlapping inputs and outputs can be effectively treated as text editing tasks.
2) We propose LASERTAGGER-a sequence tagging-based model for text editing, together with a method for generating the tag vocabulary from the training data.
3) We describe two versions of the tagging model: (i) LASERTAGGER FF -a tagger based on BERT (Devlin et al., 2019) and (ii) LASERTAGGER AR -a novel tagging model combining the BERT encoder with an autoregressive Transformer decoder, which further improves the results over the BERT tagger. 4) We evaluate LASERTAGGER against strong seq2seq baseline models based on the BERT architecture. Our baseline models outperform previously reported state-of-the-art results on two tasks. 5) We demonstrate that a) LASERTAGGER AR achieves state-of-the-art or comparable results on 3 out of 4 examined tasks, b) LASERTAGGER FF is up to 100x faster at inference time with performance comparable to the state-of-the-art seq2seq models. Furthermore, both models: c) require much less training data compared to the seq2seq models, d) are more controllable and interpretable than seq2seq models due to the small vocabulary of edit operations, e) are less prone to typical seq2seq model errors, such as hallucination.
The code will be available at: lasertagger.page.link/code

Related Work
Recent work discusses some of the difficulties of learning neural decoders for text generation (Wiseman et al., 2018;Prabhakaran et al., 2018). Conventional seq2seq approaches require large amounts of training data, are hard to control and to constrain to desirable outputs. At the same time, many NLP tasks that appear to be full-fledged text generation tasks are natural testbeds for simpler methods. In this section we briefly review some of these tasks.
Text Simplification is a paraphrasing task that is known to benefit from modeling edit operations. A simple instance of this type are sentence com-pression systems that apply a drop operation at the token/phrase level (Filippova and Strube, 2008;Filippova et al., 2015), while more intricate systems also apply splitting, reordering, and lexical substitution (Zhu et al., 2010). Simplification has also been attempted with systems developed for phrasebased MT (Xu et al., 2016a), as well as with neural encoder-decoder models (Zhang and Lapata, 2017).
Independent of this work, Dong et al. (2019) recently proposed a text-editing model, similar to ours, for text simplification. The main differences to our work are: (i) They introduce an interpreter module which acts as a language model for the so-far-realized text, and (ii) they generate added tokens one-by-one from a full vocabulary rather than from an optimized set of frequently added phrases. The latter allows their model to generate more diverse output, but it may negatively effect the inference time, precision, and the data efficiency of their model. Another recent model similar to ours is called Levenshtein Transformer Gu et al. (2019), which does text editing by performing a sequence of deletion and insertion actions.
Single-document summarization is a task that requires systems to shorten texts in a meaningpreserving way. It has been approached with deletion-based methods on the token level (Filippova et al., 2015) and the sentence level (Narayan et al., 2018;Liu, 2019). Other papers have used neural encoder-decoder methods (Tan et al., 2017;Rush et al., 2015;Paulus et al., 2017) to do abstractive summarization, which allows edits beyond mere deletion. This can be motivated by the work of Jing and McKeown (2000), who identified a small number of fundamental high-level editing operations that are useful for producing summaries (reduction, combination, syntactic transformation, lexical paraphrasing, generalization/specification, and reordering). See et al. (2017) extended a neural encoder-decoder model with a copy mechanism to allow the model to more easily reproduce input tokens during generation.
Out of available summarization datasets (Dernoncourt et al., 2018), we find the one by Toutanova et al. (2016) particularly interesting because (1) it specifically targets abstractive summarization systems, (2) the lengths of texts in this dataset (short paragraphs) seem well-suited for text editing, and (3) an analysis showed that the dataset covers many different summarization operations.
In Grammatical Error Correction (Ng et al., 2013(Ng et al., , 2014 a system is presented with input texts written usually by a language learner, and is tasked with detecting and fixing grammatical (and other) mistakes. Approaches to this task often incorporate task-specific knowledge, e.g., by designing classifiers for specific error types (Knight and Chander, 1994;Rozovskaya et al., 2014) that can be trained without manually labeled data, or by adapting statistical machine-translation methods (Junczys-Dowmunt and Grundkiewicz, 2014). Methods for the sub-problem of error detection are similar in spirit to sentence compression systems, in that they are implemented as word-based neural sequence labelers (Rei, 2017;. Neural encoderdecoder methods are also commonly applied to the error correction task (Ge et al., 2018;Chollampatt and Ng, 2018;, but suffer from a lack of training data, which is why taskspecific tricks need to be applied (Kasewa et al., 2018;Junczys-Dowmunt et al., 2018).

Text Editing as a Tagging Problem
Our approach to text editing is to cast it into a tagging problem. Here we describe its main components: (1) the tagging operations, (2) how to convert plain-text training targets into a tagging format, as well as (3) the realization step to convert tags into the final output text.

Tagging Operations
Our tagger assigns a tag to each input token. A tag is composed of two parts: a base tag and an added phrase. The base tag is either KEEP or DELETE, which indicates whether to retain the token in the output. The added phrase P , which can be empty, enforces that P is added before the corresponding token. P belongs to a vocabulary V that defines a set of words and phrases that can be inserted into the input sequence to transform it into the output.
The combination of the base tag B and the added phrase P is treated as a single tag and denoted by P B. The total number of unique tags is equal to the number of base tags times the size of the phrase vocabulary, hence there are ≈ 2|V | unique tags.
Additional task-specific tags can be employed too. For sentence fusion (Section 5.1), the input consists of two sentences, which sometimes need to be swapped. Therefore, we introduce a custom tag, SWAP, which can only be applied to the last period of the first sentence (see Fig. 2). This tag instructs the Realize step to swap the order of the input sentences before realizing the rest of the tags.
For other tasks, different supplementary tags may be useful. E.g., to allow for replacing entity mentions with the appropriate pronouns, we could introduce a PRONOMINALIZE tag. Given an access to a knowledge base that includes entity gender information, we could then look up the correct pronoun during the realization step, instead of having to rely on the model predicting the correct tag ( she DELETE, he DELETE, they DELETE, etc.).

Optimizing Phrase Vocabulary
The phrase vocabulary consists of phrases that can be added between the source words. On the one hand, we wish to minimize the number of phrases to keep the output tag vocabulary small. On the other hand, we would like to maximize the percentage of target texts that can be reconstructed from the source using the available tagging operations. This leads to the following combinatorial optimization problem.

Problem 1 Given a collection of phrase sets
This problem is closely related to the minimum k-union problem which is NP-hard (Vinterbo, 2002). The latter problem asks for a set of k phrase sets such that the cardinality of their union is the minimum. If we were able to solve Problem 1 in polynomial time, we could solve also the minimum k-union problem in polynomial time simply by finding the smallest phrase vocabulary size such that the number of covered phrase sets is at least k. This reduction from the minimum k-union problem gives us the following result: To identify candidate phrases to be included in the vocabulary, we first align each source text s from the training data with its target text t. This is achieved by computing the longest common subsequence (LCS) between the two word sequences, which can be done using dynamic programming in time O(|s| × |t|). The n-grams in the target text that are not part of the LCS are the phrases that would need to be included in the phrase vocabulary to be able to construct t from s.  In practice, the phrase vocabulary is expected to consist of phrases that are frequently added to the target. Thus we adopt the following simple approach to construct the phrase vocabulary: sort the phrases by the number of phrase sets in which they occur and pick most frequent phrases. This was found to produce meaningful phrase vocabularies based on manual inspection as shown in Section 5. E.g., the top phrases for sentence fusions include many discourse connectives.
We also considered a greedy approach that constructs the vocabulary one phrase at a time, always selecting the phrase that has the largest incremental coverage. This approach is not, however, ideal for our use case, since some frequent phrases, such as "(" and ")", are strongly coupled. Selecting "(" alone has close to zero incremental coverage, but together with ")", they can cover many examples.

Converting Training Targets into Tags
Once the phrase vocabulary is determined, we can convert the target texts in our training data into tag sequences. Given the phrase vocabulary, we do not need to compute the LCS, but can leverage a more efficient approach, which iterates over words in the input and greedily attempts to match them (1) against the words in the target, and in case there is no match, (2) against the phrases in the vocabulary V . This can be done in O(|s| × n p ) time, where n p is the length of the longest phrase in V , as shown in Algorithm 1.
The training targets that would require adding a phrase that is not in our vocabulary V , will not get converted into a tag sequence but are filtered out. While making the training dataset smaller, this may effectively also filter out low-quality targets. The percentage of converted examples for different datasets is reported later in Section 5. Note that even when the target cannot be reconstructed from the inputs using our output tag vocabulary, our approach might still produce reasonable outputs with the available phrases. E.g., a target may require the use of the infrequent ";" token, which is not in our vocabulary, but a model could instead choose to predict a more common "," token.
Algorithm 1 Converting a target string to tags.

Realization
After obtaining a predicted tag sequence, we convert it to text ("realization" step). While classic works on text generation make a distinction between planning and realization, end-to-end neural approaches typically ignore this distinction, with the exception of few works (Moryossef et al., 2019;Puduppully et al., 2018). For the basic tagging operations of keeping, deleting, and adding, realization is a straightforward process. Additionally, we adjust capitalization at sentence boundaries. Realization becomes more involved if we introduce special tags, such as PRONOMINALIZE mentioned in Section 3.1. For this tag, we would need to look up the gender of the tagged entity from a knowledge base. Having a separate realization step is beneficial, since we can decide to pronominalize only when confident about the appropriate pronoun and can, otherwise, leave the entity mention untouched.
Another advantage of having a separate realiza-  tion step is that specific loss patterns can be addressed by adding specialized realization rules. For instance, one could have a rule that when applying tag his DELETE to an entity mention followed by 's, the realizer must always DELETE the possessive 's regardless of its predicted tag.

Tagging Model Architecture
Our tagger is composed of two components: an encoder, which generates activation vectors for each element in the input sequence, and a decoder, which converts encoder activations into tag labels. Encoder. We choose the BERT Transformer model (Devlin et al., 2019) as our encoder, as it demonstrated state-of-the-art results on a number of sentence encoding tasks. We use the BERT-base architecture, which consists of 12 self-attention layers. We refer the reader to (Devlin et al., 2019) for a detailed description of the model architecture and its input representation. We initialize the encoder with a publicly available checkpoint of the pretrained case-sensitive BERT-base model. 1 Decoder. In the original BERT paper a simple decoding mechanism is used for sequence tagging: the output tags are generated in a single feed-forward pass by applying an argmax over the encoder logits. In this way, each output tag is predicted independently, without modelling the dependencies between the tags in the sequence. Such a simple decoder demonstrated state-of-the-art results on the Named Entity Recognition task, when applied on top of the BERT encoder.
To better model the dependencies between the output tag labels, we propose a more powerful autoregressive decoder. Specifically, we run a singlelayer Transformer decoder on top of the BERT encoder (see Fig. 3). At each step, the decoder is 1 github.com/google-research/bert consuming the embedding of the previously predicted label and the activations from the encoder.
There are several ways in which the decoder can communicate with the encoder: (i) through a full attention over the sequence of encoder activations (similar to conventional seq2seq architectures); and (ii) by directly consuming the encoder activation at the current step. In our preliminary experiments, we found the latter option to perform better and converge faster, as it does not require learning additional encoder-decoder attention weights.
We experiment with both decoder variants (feedforward and autoregressive) and find that the autoregressive decoder outperforms the previously used feedforward decoder. In the rest of this paper, the tagging model with an autoregressive decoder is referred to as LASERTAGGER AR and the model with feedforward decoder as LASERTAGGER FF .

Experiments
We evaluate our method by conducting experiments on four different text editing tasks: Sentence Fusion, Split and Rephrase, Abstractive Summarization, and Grammatical Error Correction.
Baselines. In addition to reporting previously published results for each task, we also train a set of strong baselines based on Transformer where both the encoder and decoder replicate the BERTbase architecture (Devlin et al., 2019). To have a fair comparison, similar to how we initialize a tagger encoder with a pretrained BERT checkpoint, we use the same initialization for the Transformer encoder. This produces a very strong seq2seq baseline (SEQ2SEQ BERT ), which already results in new state-of-the-art metrics on two out of four tasks.

Sentence Fusion
Sentence Fusion is the problem of fusing sentences into a single coherent sentence.
Data. We use the "balanced Wikipedia" portion of Geva et al. (2019)   which is the percentage of exactly correctly predicted fusions, and SARI (Xu et al., 2016b), which computes the average F1 scores of the added, kept, and deleted n-grams. 2 Vocabulary Size. To understand the impact of the number of phrases we include in the vocabulary, we trained models for different vocabulary sizes (only LASERTAGGER AR ). The results are shown in Figure 4. After increasing the vocabulary size to 500 phrases, Exact score reaches a plateau, so we set the vocabulary size to 500 in all the remaining experiments of this paper. 3 The Gold curve in Fig. 4 shows that this vocabulary size is sufficient  to cover 85% of the training examples, which gives us an upper bound for the Exact score. Comparison against Baselines. Table 2 lists the results for the DfWiki dataset. We obtain new SOTA results with LASERTAGGER AR , outperforming the previous SOTA 7-layer Transformer model from Geva et al. (2019) by 2.7% Exact score and 1.0% SARI score. We also find that the pretrained SEQ2SEQ BERT model yields nearly as good performance, demonstrating the effectiveness of unsupervised pretraining for generation tasks. The performance of the tagger is impaired significantly when leaving out the SWAP tag due to the model's inability to reconstruct 10.5% of the training set.
Impact of Dataset Size. We also study the effect of the training data size by creating four increasingly smaller subsets of DfWiki (see Fig. 5a). 4 When data size drops to 450 or 4 500 examples, LASERTAGGER still performs surprisingly well, clearly outperforming the SEQ2SEQ BERT baseline.

Split and Rephrase
The reverse task of sentence fusion is the splitand-rephrase task, which requires rewriting a long sentence into two or more coherent short sentences.
Data. We use the WikiSplit dataset (Botha et al., 2018), which consists of 1M human-editor created examples of sentence splits, and follow the dataset split suggested by the authors. Using the phrase vocabulary of size 500 yields a 31% coverage of the targets from the training set (top phrases shown in Table 1). The lower coverage compared to DfWiki suggests a higher amount of noise (due to Wikipedia-author edits unrelated to splitting).
Results  et al., 2017). 5 The results are shown in Table 3. SEQ2SEQ BERT and LASERTAGGER AR yield similar performance with each other, and they both outperform the seq2seq model with a copying mechanism from Botha et al. (2018). We again studied the impact of training-data size by subsampling the training set, see Figure 5b. Similar to the previous experiment, the LASERTAGGER methods degrade more gracefully when reducing training-data size, and start to outperform the seq2seq baseline once going below circa 10k examples. The smallest training set for LASERTAGGER AR contains merely 29 examples. Remarkably, the model is still able to learn something useful that generalizes to unseen test examples, reaching a SARI score of 53.6% and predicting 5.2% of the targets exactly correctly. The following is an example prediction by the model: Here the model has picked the right comma to replace with a period and a sentence separator.

Abstractive Summarization
The task of summarization is to reduce the length of a text while preserving its meaning. Dataset. We use the dataset from Toutanova et al. (2016), which contains 6,168 short input texts (one or two sentences) and one or more humanwritten summaries. The human experts were not restricted to just deleting words when generating a summary, but were allowed to also insert new words and reorder parts of the sentence, which makes this dataset particularly suited for abstractive summarization models.
We set the size of the phrase vocabulary to 500, as for the other tasks, and extract the phrases from the training partition. With a size of 500, we are able to cover 89% of the training data.
In addition to the metrics from the previous sections, we report ROUGE-L (Lin, 2004), as this is a metric that is commonly used in the summarization literature. ROUGE-L is a recall-oriented measure computed as the longest common sub-sequence between a reference summary and a candidate summary.
Results. Table 4 compares our taggers against seq2seq baselines and systems from the literature. 6 Filippova et al. (2015) and Clarke and Lapata (2008) proposed deletion-based approaches; the former uses a seq2seq network, the latter formulates summarization as an optimization problem that is solved via integer-linear programming. Cohn and Lapata (2008) proposed an early approach to abstractive summarization via a parsetree transducer. Rush et al. (2015) developed a neural seq2seq model for abstractive summarization.
In line with the results on the subsampled fusion/splitting datasets (Figure 5), the tagger significantly outperforms all baselines. This shows that even though a text-editing approach is not wellsuited for extreme summarization examples (a complete paraphrase with zero lexical overlap), in practice, already a limited paraphrasing capability is enough to reach good empirical performance.
Note that the low absolute values for the Exact metric are expected, since there is a very large number of acceptable summaries.

Grammatical Error Correction (GEC)
GEC requires systems to identify and fix grammatical errors in a given input text.
Data. We use a recent benchmark from a shared task of the 2019 Building Educational Applications workshop, specifically from the Low Resource track 7 (Bryant et al., 2019). The publicly available set has 4,384 ill-formed sentences together with gold error corrections, which we split 9:1 into a training and validation partition. We again create the phrase vocabulary from the 500 most frequently added phrases in the training partition, which gives us a coverage of 40% of the training data.
Evaluation Metrics and Results. We report precision and recall, and the task's main metric F 0.5 , which gives more weight to the precision of the corrections than to their recall. Table 5 compares our taggers against two baselines. Again, the tagging approach clearly outperforms the BERT-based seq2seq model, here by being more than seven times as accurate in the prediction of corrections. This can be accounted to the seq2seq model's much richer generation capacity, which the model can not properly tune to the task at hand given the small amount of training data. The tagging approach on the other hand is naturally suited to this kind of problem.
We also report the best-performing method by Grundkiewicz et al. (2019)

Inference time
Getting state-of-the-art results often requires using larger and more complex models. When running a model in production, one cares not only about the accuracy but also the inference time. Table 6 reports latency numbers for LASERTAGGER models and our most accurate seq2seq baseline. As one can see, the SEQ2SEQ BERT baseline is impractical to run in production even for the smallest batch size. On the other hand, for a batch size 8, LASERTAGGER AR is already 10x faster than comparable-in-accuracy SEQ2SEQ BERT baseline. This difference is due to the former model using a 1-layer decoder (instead of 12 layers) and no encoder-decoder cross attention. We also tried training SEQ2SEQ BERT with a 1-layer decoder but it performed very poorly in terms of accuracy. Finally, LASERTAGGER FF is more than 100x faster while being only a few accuracy points below our best reported results.

Qualitative evaluation
To assess the qualitative difference between the outputs of LASERTAGGER and SEQ2SEQ BERT , we analyzed the texts generated by the models on the test sets of the four tasks. We inspected the respective worst predictions from each model according to BLEU and identified seven main error patterns,

Hallucinations less affected affected
In: Tobacco smokers may also experience . . . Out: anthropology smokers may also experience . . .

Coreference issues affected affected
In: She is the daughter of Alistair Crane . . . who secretly built . . . Out: She is the daughter of Alistair Crane . . . :::: She secretly built . . .

Misleading rephrasing affected affected
In: . . . postal service was in no way responsible . . . Out: . . . postal service was responsible . . . Lazy sentence splitting affected not affected In: Home world of the Marglotta located in the Sagittarius Arm. Out: Home world of the Marglotta . :::: Located in the Sagittarius Arm. two of which are specific to the seq2seq model, and one being specific to LASERTAGGER.
This illustrates that LASERTAGGER is less prone to errors compared to the standard seq2seq approach, due to the restricted flexibility of its model. Certain types of errors, namely imaginary words and repeated phrases, are virtually impossible for the tagger to make. The likelihood of others, such hallucination and abrupt sentence ending, is at least greatly reduced.
In Table 7, we list the error classes and refer to Appendix A for more details on our observations.

Conclusions
We proposed a text-editing approach to textgeneration tasks with high overlap between input and output texts. Compared to the seq2seq models typically applied in this setting, our approach results in a simpler sequence-tagging problem with a much smaller output tag vocabulary. We demonstrated that this approach has comparable performance when trained on medium-to-large datasets, and clearly outperforms a strong seq2seq baseline when the number of training examples is limited. Qualitative analysis of the model outputs suggests that our tagging approach is less affected by the common errors of the seq2seq models, such as hallucination and abrupt sentence ending. We further demonstrated that tagging can speed up inference by more than two orders of magnitude, making it more attractive for production applications.
Limitations. Arbitrary word reordering is not feasible with our approach, although limited reordering can be achieved with deletion and insertion operations, as well as custom tags, such as SWAP (see Section 3.1). To enable more flexible reordering, it might be possible to apply techniques developed for phrase-based machine translation. Another limitation is that our approach may not be straightforward to apply to languages that are morphologically richer than English, where a more sophisticated realizer might be needed to adjust, e.g., the cases of the words.
In future work, we would like to experiment with more light-weight tagging architectures (Andor et al., 2016) to better understand the trade-off between inference time and model accuracy.