Learning To Split and Rephrase From Wikipedia Edit History

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia’s edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.


Introduction
A complex sentence can typically be rewritten into multiple simpler ones that together retain the same meaning. Performing this split-and-rephrase task is one of the main operations in text simplification, alongside paraphrasing and dropping less salient content (Siddharthan, 2006;Zhu et al., 2010;Woodsend and Lapata, 2011, i.a.). The area of automatic text simplification has received a lot of attention (Siddharthan, 2014;Shardlow, 2014), yet still holds many open challenges (Xu et al., 2015). Splitting sentences in this way could also benefit systems where predictive quality degrades with sentence length, as observed in, e.g., relation extraction (Zhang et al., 2017) and translation (Koehn and Knowles, 2017). And the schema-free nature of the task may allow for future supervision in the form of crowd-sourced rather than expensive expert annotation (He et al., 2015).  introduce the WebSplit corpus for the split-and-rephrase task and report results for several models on it. Aharoni and Goldberg (2018) improve WebSplit by reducing overlap in the data splits, and demonstrate that neural * Both authors contributed equally.
A classic leaf symptom is water-soaked lesions between the veins which appear as angular leaf-spots where the lesion edge and vein meet.
A classic leaf symptom is the appearance of angular, water-soaked lesions between the veins. The angular appearance results where the lesion edge and vein meet. Figure 1: A split-and-rephrase example extracted from a Wikipedia edit, where the top sentence had been edited into two new sentences by removing some words (yellow) and adding others (blue). encoder-decoder models (Bahdanau et al., 2014) perform poorly, even when enhanced with a copy mechanism (Gu et al., 2016;See et al., 2017).
One limitation of the WebSplit examples themselves is that they contain fairly unnatural linguistic expression using a small vocabulary. We introduce new training data mined from Wikipedia edit histories that have some noise, but which have a rich and varied vocabulary over naturally expressed sentences and their extracted splits. Figure 1 gives an example of how a Wikipedia editor rewrote a single sentence into two simpler ones. We create WikiSplit, a set of one million such examples mined from English Wikipedia, and show that models trained with this resource produce dramatically better output for split and rephrase.
Our primary contributions are: • A scalable, language agnostic method for extracting split-and-rephrase rewrites from Wikipedia edits. • Public release of the English WikiSplit dataset, containing one million rewrites: http://goo.gl/language/wiki-split • By incorporating WikiSplit into training, we more than double (30.5 to 62.4) the BLEU score obtained on WebSplit by Aharoni and Goldberg (2018).

Correct
Street Rod is the first in a series of two games released for the PC and Commodore 64 in 1989. Street Rod is the first in a series of two games. It was released for the PC and Commodore 64 in 1989.
He played all 60 minutes in the game and rushed for 114 yards, more yardage than all the Four Horsemen combined.
He played all 60 minutes in the game. He rushed for 114 yards, more yardage than all the Four Horsemen combined.

Unsupported
When the police see Torco's injuries, they send Ace to a clinic to be euthanized, but he escapes and the clinic worker covers up his incompetence. When the police see Torco's injuries to his neck, they believe it is a result of Ace biting him. They send Ace to a clinic to be euthanized, but he escapes and the clinic worker covers up his incompetence.

Missing
The avenue was extended to Gyldenløvesgade by Copenhagen Municipality in 1927-28 and its name was changed to Rosenørns Allé after Ernst Emil Rosenørn (1810-1894) . The avenue was extended to Gyldenløvesgade by Copenhagen Municipality in 1927-28. The street was named after Ernst Emil Rosenørn (1810-1894) . Table 1: Examples of correct and noisy sentence splits extracted from Wikipedia edits. Noise from unsupported or missing statements is visualized with the same coloring as in Figure 1.

The WikiSplit Corpus
WebSplit provides a basis for measuring progress on splitting and rephrasing sentences. However, its small size, inherent repetitiveness, and synthetic nature limit its broader applicability. In particular, we see it as a viable benchmark for evaluating models, but not for training them. To that end, we introduce the WikiSplit corpus and detail its construction next.

Mining Wikipedia Edits
Wikipedia maintains snapshots of entire documents at different timestamps, which makes it possible to reconstruct edit histories for documents. This has been exploited for many NLP tasks, including sentence compression (Yamangil and Nelken, 2008), text simplification (Yatskar et al., 2010;Woodsend and Lapata, 2011;Tonelli et al., 2016) and modeling semantic edit intentions (Yang et al., 2017).
To construct the WikiSplit corpus, we identify edits that involve sentences being split. A list of sentences for each snapshot is obtained by stripping HTML tags and Wikipedia markup and running a sentence break detector (Gillick, 2009). Temporally adjacent snapshots of a Wikipedia page are then compared to check for sentences that have undergone a split like that shown in Figure 1. We search for splits in both temporal directions.
Given all candidate examples extracted this way, we use a high-precision heuristic to retain only high quality splits. To extract a full sentence C and its candidate split into S = (S 1 , S 2 ), we  require that C and S 1 have the same trigram prefix, C and S 2 have the same trigram suffix, and S 1 and S 2 have different trigram suffixes. To filter out misaligned pairs, we use BLEU scores (Papineni et al., 2002) to ensure similarity between the original and the split versions. Specifically, we discard pairs where BLEU(C, S 1 ) or BLEU(C, S 2 ) is less than δ (an empirically chosen threshold). If multiple candidates remain for a given sentence C, we retain arg max S (BLEU(C, S 1 ) + BLEU(C, S 2 )). 1

Corpus Statistics and Quality
Our extraction heuristic is imperfect, so we manually assess corpus quality using the same categorization schema proposed by Aharoni and Goldberg (2018); see Table 1 for examples of correct, unsupported and missing sentences in splits extracted from Wikipedia. We do this for 100 randomly selected examples using three different  Table 3: Training corpus statistics in terms of complex sentences (C), simple sentences (S =∪ i S i ) and tokens (t, appearing across unique complex sentences). WikiSplit provides much greater diversity and scale. thresholds of δ. As shown in Table 2, δ=0.2 provides the best trade-off between quality and size. Out of the 100 complex sentences in the sample, only 4 contained information that was not completely covered by the simple sentences. In our corpus, every complex sentence is split into two simpler sentences, so the sample contains 200 simple sentences. Out of these we found 168 (84%) to be correct, while 35 (18%) contained unsupported facts. Thus, for the overall sample of 100 split-and-rephrase examples, 68% are perfect while 32% contain some noise (either unsupported facts or missing information). We stress that our main goal is to use data extracted this way as training data and accept that its use for evaluation is an imperfect signal with some inherent noise and bias (by construction).
After extraction and filtering, we obtain over one million examples of sentence splits from around 18 million English documents. We randomly reserved 5000 examples each for tuning, validation and testing, producing 989,944 unique complex training sentences, compared to the 16,938 of WebSplit (cf. Table 3).

Comparison to WebSplit
Narayan et al. (2017) derived the WebSplit corpus by matching up sentences in the WebNLG corpus  according to partitions of their underlying meaning representations (RDF triples). The WebNLG corpus itself was created by having crowd workers write sentential realizations of one or more RDF triples. The resulting language is often unnatural, for example, "Akeem Dent once played for the Houston Texans team which is based in Houston in Texas." 2 Repetition arises because the same sentence fragment may appear in many different examples. This is to be expected given that WebSplit's small vocabulary of 7k words must account for the 344k tokens that make up the distinct complex sentences themselves. 3 This is compounded in that each sentence contains a named entity by construction. In contrast, our large new WikiSplit dataset offers more natural and diverse text (see examples in Table 1), having a vocabulary of 633k items covering the 33m tokens in its distinct complex sentences.
The task represented by our WikiSplit dataset is a priori both harder and easier than that of the WebSplit dataset -harder because of the greater diversity and sparsity, but potentially easier due to the uniform use of a single split.
Of the two datasets, WebSplit is better suited for evaluation: its construction method guarantees cleaner data than is achieved by our extraction heuristic, and it provides multiple reference decompositions for each complex sentence, which tends to improve the correlation of automatic metrics with human judgment in related text generation tasks (Toutanova et al., 2016).

Experiments
In order to understand how WikiSplit can inform the split-and-rephrase task, we vary the composition of the training set when training a fixed model architecture. We compare three training configurations: WEBSPLIT only, WIKISPLIT only, and BOTH, which is simply their concatenation.
Text-to-text training instances are defined as all the unique pairs of (C, S), where C is a complex sentence and S is its simplification into multiple simple sentences Aharoni and Goldberg, 2018). For training, we delimit the simple sentences with a special symbol. We depart from the prior work by only using a subset of the WebSplit training set: we take a fixed sub-sample such that each distinct C is paired with a single S, randomly selected from the multiple possibilities in the dataset. This scheme produced superior performance in preliminary experiments.
As a quality measure, we report multi-reference corpus-level BLEU 4 (Papineni et al., 2002), but  include sentence-level BLEU (sBLEU) for direct comparison to past work. 5 We also report lengthbased statistics to quantify splitting. We use the same sequence-to-sequence architecture that produced the top result for Aharoni and Goldberg (2018), "Copy512", which is a onelayer, bi-directional LSTM (cell size 512) with attention (Bahdanau et al., 2014) and a copying mechanism (See et al., 2017) that dynamically interpolates the standard word distribution with a distribution over the words in the input sentence. Training details are as described in the Appendix of Aharoni and Goldberg (2018) using the OpenNMT-py framework (Klein et al., 2017). 6

Results
We compare to the SOURCE baseline, which is the previously reported method of taking the unmodified input sentence as prediction, and we add SPLITHALF, the natural baseline of deterministically splitting a complex sentence into two equallength token sequences and appending a period to the first one. Table 4 compares our three training configurations on the validation sets of both WebSplit and WikiSplit. The WEBSPLIT model scores 35.3 BLEU on the WebSplit validation set but fails to generalize beyond its narrow domain, as evidenced by reaching only 4.2 BLEU on the WikiSplit validation set.
The example predictions in Table 7 illustrate how this model tends to drop content ("Alfred Warden", "mouth", "Hamburg"), hallucinate common elements from its training set ("food", "ingredient", "publisher") and generally fails to produce coherent sentences.  Table 5: Results on the WebSplit v1.0 test set when varying the training data while holding model architecture fixed: corpus-level BLEU, sentence-level BLEU (to match past work), simple sentences per complex sentence, and tokens per simple sentence (microaverage). AG18 is the previous best model by Aharoni and Goldberg (2018), which used the full WebSplit training set, whereas we downsampled it.
In contrast, the WIKISPLIT model achieves 59.4 BLEU on the WebSplit validation set, without observing any in-domain data. It also outperforms the two deterministic baselines on both validation sets by a non-trivial BLEU margin. This indicates that the WikiSplit training data enable better generalization than when using WebSplit by itself. Reintroducing the downsampled, in-domain training data (BOTH) further improves performance on the WebSplit evaluation.
These gains in BLEU from using WikiSplit carry over to the blind manual evaluation we performed on a random sample of model predictions on the WebSplit validation set. As shown in Table 6, the BOTH model produced the most accurate output (95% correct simple sentences), with the lowest incidence of missed or unsupported statements. Our manual evaluation includes the corresponding outputs from Aharoni and Goldberg (2018) The examples in Table 7 demonstrate that the WIKISPLIT and BOTH models produce much more coherent output which faithfully rephrases the input. In Example 1, the combined model (BOTH) produces three fluent sentences, overcoming the strong bias toward two-sentence output inherent in the majority of its training examples.
We relate our approach to prior work on Web-Split v1.0 by reporting scores on its test set in Table 5. Our best performance in BLEU is again obtained by combining the proposed WikiSplit dataset with the downsampled WebSplit, yielding a 32 point improvement over the prior best result.

Input
He is a short , dark grayish -green monster with a beak -like mouth and a cap made out of the upper part of a creature 's skull . WEBSPLIT He is a food found in grayish . grayish monster is an ingredient in grayish . WIKISPLIT He is a short , dark grayish -green monster with a beak -like mouth . A cap made out of the upper part of a creature 's skull . BOTH He is a short , dark grayish -green monster with a beak -like mouth . His cap is made out of the upper part of a creature 's skull .

Input
Der beglückte Florindo is an opera composed by Handel in Hamburg in 1708 as part of a double opera , the other part being Die verwandelte Daphne . WEBSPLIT Handel opera is the publisher of the opera opera . Handel is the capital of 1708 . WIKISPLIT Der beglückte Florindo is an opera composed by Handel in Hamburg in 1708 . It was part of a double opera , the other part being Die verwandelte Daphne . BOTH Der beglückte Florindo is an opera composed by Handel in Hamburg in 1708 as part of a double opera . The other part being Die verwandelte Daphne .  Aharoni and Goldberg (2018), while the other outputs are from our models trained on the corresponding data.

Conclusion and Outlook
Our results demonstrate a large, positive impact on the split-and-rephrase task when training on large, diverse data that contains some noise. This suggests that future improvements may come from finding other such sources of data as much as from modeling. The new WikiSplit dataset is intended as training data, but for further progress on the split-and-rephrase task, we ideally need evaluation data also derived from naturally occurring sen-tences, and an evaluation metric that is more sensitive to the particularities of the task.

Acknowledgments
Thanks go to Kristina Toutanova and the anonymous reviewers for helpful feedback on an earlier draft, and to Roee Aharoni for supplying his system's outputs.