Small but Mighty: New Benchmarks for Split and Rephrase

Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models.


Introduction
Split and Rephrase is the task of rewriting a presumably long and complex sentence into shorter and simpler sentences, while maintaining the same meaning. For example, one possible way to split the sentence "Voiced by Aoi Koga, Kaguya is the series' titular character, popular among a wide audience." would result in "Kaguya is voiced by Aoi Koga. Kaguya is the series' titular character. Kaguya is popular among a wide audience." While the split sentences have to be coherent, paraphrasing is not enforced. For example, the word "titular" does not have to be replaced. This type of text simplification is challenging as its natural language generation process potentially involves multiple sub-processes such as co-reference resolution, * Work done during internship at IBM Research.
† Work done during employment at IBM Research.
named-entity recognition, semantic role labelling, etc. Split and Rephrase has two main real-world uses: first, to benefit systems whose performance improves with decreasing length of sentences e.g. entity extraction (Zhang et al., 2017) and machine translation (Koehn and Knowles, 2017) by acting as a pre-processing step; second, to benefit human readers, especially those less proficient with the language when reading complex documents such as terms and agreements, in understanding the meaning more easily and accurately (Inui et al., 2003;Siddharthan, 2002). Datasets of the Split and Rephrase task contain pairs of a complex sentence and a presumably meaning-preserving simplified rewrite containing multiple simpler sentences. The task was introduced by , with the release of the WebSplit corpus. Afterwards, Aharoni and Goldberg (2018) proposed the state-of-the-art model to date, a sequence-to-sequence model (Bahdanau et al., 2015) with a copy mechanism (Gu et al., 2016;See et al., 2017) with the observation that most texts are unchanged during a Split and Rephrase operation. Later, Botha et al. (2018) introduced the WikiSplit corpus to be used as large but noisy training data, which the authors reported to be unsuitable as the evaluation data. Also, Sulem et al. (2018) studied the problems of using BLEU as the evaluation metric for this task, while proposing a manually constructed test set called HSplit.
We argue that the widely used benchmark dataset of Split and Rephrase, the WebSplit test set (known as simply WebSplit below), is not suitable for evaluation. Apart from its series of limitations already reported, such as a small vocabulary, unnatural expressions, etc. (Botha et al., 2018), we further show that its complex sentences systematically follow only 3 syntactical patterns marked by lexical cues ( § 2). To demonstrate the implication of such limitations of WebSplit, we show that a simple, unsupervised rule-based model with only 3 corresponding operations can perform even slightly better than the state-of-the-art neural model ( § 3).
To remedy the limitations of WebSplit, we crowdsource two new benchmarks with significantly more diverse syntax in the Wikipedia and legal contract domain with hundreds of humanwritten complex-simple sentence pairs ( § 4). We carefully control for their quality based on 6 welldefined criteria of what constitutes a good Split and Rephrase rewrite. While most related work reports model performance using the widely criticized BLEU score and manual evaluation with no clear rubric, we perform fine-grained model evaluation using these 6 criteria, rated by crowd workers, showing that our benchmarks present models with greater challenges ( § 5). 1

Issues with WebSplit
WebSplit and Wiki-Split are two widely used datasets for the Split and Rephrase task. Because WikiSplit is derived from the edit history of Wikipedia, versions of passages are not necessarily written by Split and Rephrases operations, as the meaning may not be preserved during edits. Hence, WikiSplit is reported by its authors to be noisy and ill-suited for evaluation for this task (Botha et al., 2018).
WebSplit is used in multiple previous works as the evaluation benchmark. It was created by automatically matching sentences in the WebNLG corpus  according to partitions of their meaning representations. The dataset has been shown to have various limitations, such as unnatural expressions, repetition of phrases (Botha et al., 2018), etc.
Furthermore, our preliminary study shows that WebSplit contains several recurring syntactic patterns marked with lexical cues. To demonstrate this, we randomly sample 100 complex sentence from the test set, and are able to categorize them with only 3 syntactical patterns marked by lexical cues (underlined), at which some almost trivial Split and Rephrase operations can take place: relative clause (rc) (48 out of 100): Scott Adsit voiced Baymax which was created by Duncan Rouleau.  It can be further noticed that most complex sentences in WebSplit are short and require only one Split and Rephrase operation. We next show that a rule-based model which only exploits these patterns can perform on par with the state-of-the-art neural model on WebSplit.

Rule-Based Model
We design a simple rule-based model to exploit the syntactic cues widely present in WebSplit.

Algorithm
The rule-based model requires no training data and only uses semantic role labeling (He et al., 2017) and dependency parsing (Dozat and Manning, 2016), running on AllenNLP . Given a complex sentence, the model makes 3 splits when applicable. First, using semantic role labeling, the model identifies a Relational Argument and makes a split with the Relational Argument replaced by the Subject Argument. Second, The model looks for the word "and", making a split accordingly. Third, using dependency parsing, the model looks for a node which is joined by the clause, which is extracted, prepended with the subject, and split as a new simple sentence, while the rest of the original complex sentence is split as another new simple sentence. 2

Performance
The rule-based model and the state-of-the-art seq2seq model trained on WikiSplit (Aharoni and Goldberg, 2018;Botha et al., 2018) are evaluated using BLEU (Papineni et al., 2002) on WebSplit. The rule-based model achieves a BLEU of 61.3, outperforming the neural model which achieves a BLEU of 56.0. The two models are also evaluated manually on 100 randomly sampled examples, with an identical accuracy of 64% (the criteria of correctness is described in § 4.2.2). While the rulebased model is imperfect and can likely improve with more and better defined rules, it serves as a strong baseline that exploits the syntactical cues in WebSplit and potentially other benchmarks generated in a similar fashion. The strong performance of such a simplistic model highlights the need of more difficult and diverse benchmark data to better capture the complexity of the Split and Rephrase task.

New Benchmark Datasets
Considering the limitations of WebSplit, an ideal benchmark must not only be challenging with diverse patterns, but also ensure that the rewrites are strictly meaning-preserving Split and Rephrase. With these two goals, we collect two benchmark datasets, Wiki Benchmark (Wiki-BM) from Wikipedia and Contracts Benchmark (Cont-BM) from the legal documents. These two datasets are to be used as gold standard for the evaluation of Split and Rephrase. To systematically control for the quality, we define 6 criteria of what constitutes a good Split and Rephrase, and validate the collected rewrites based on these criteria.

Collecting Complex Sentences
First, we gather complex sentences as the input for the Split and Rephrase operation.

Wiki Benchmark (Wiki-BM)
While the simplified rewrites in the WikiSplit dataset are not guaranteed to be meaning preserving and cannot be used in a benchmark, the original complex sentences are semantically and syntactically diverse, with adequate complexity. From the 5000 complex sentences from the WikiSplit test set, we randomly select 500 for budget reasons with only alphanumerical characters, whitesspaces, commas and periods, and manually inspect them to ensure that they are well-formed.

Contracts Benchmark (Cont-BM)
We collect sentences from publicly available legal procurement contracts online, and contract templates within IBM with no confidential information. We randomly sample and inspect 500 sentences in the same manners as above.

Syntactical Diversity
To demonstrate that our complex sentences are syntactically diverse and are not plagued by patterns analyzed before, we randomly sample 100 complex sentences from each benchmark to annotate them by syntactical patterns. In addition to the 3 patterns outlined before, we define the following new patterns (   infinitive clause (inf): Nimfa was forced to take part of a devilish plan to fool the Saavedra family.
The counts from the manual annotation are shown in Table 1. Wiki-BM has more diverse patterns and number of patterns per complex sentence than WebSplit, while Cont-BM has the most. The difference of complexity in the 3 benchmarks would be beneficial for evaluation.

Collecting Simplified Rewrites
We ask a set of crowd workers to Split and Rephrase the gathered complex sentences on Amazon Mechanical Turk, and another set to ensure their quality 3 . We divide the crowdsourcing workflow into two phases.

Phase 1: Rewrite
For each complex sentence, we ask 3 crowd workers to rewrite it by splitting and rephrasing, with the option to flag the complex sentence as too simple or too problematic to split, which we later discard. We require Master Qualification, and pay $0.2 per HIT for the complex sentences from Wiki-BM and $0.4 per HIT for the more challenging Cont-BM. This Phase costs $1,125 in total.

Phase 2: Rate
For each crowdsourced rewrite submitted in Phase 1, we ask 2 different crowd workers to evaluate its  In each benchmark, we now have 500 complex sentences, each with 3 rewrites, each with 2 ratings. For each rating, if the worker answers 5 for the first two criteria, and chooses "no" for last four criteria, we denote this rating as correct. For each rewrite, if both of its ratings are correct, we denote this rewrite as perfect. To ensure high quality of the gold standard, we only keep the rewrites that are perfect as gold standard corresponding to their complex sentences in our benchmarks.

Descriptive Statistics
Some descriptive statistics and the comparison with WebSplit are shown in Table 2. While our new benchmarks are smaller than WebSplit, we argue that a small number of human-written, high quality ground-truth simple rewrites are better suited for evaluation than a larger number of automatically generated, noisy ones.
While similar to HSplit (Sulem et al., 2018), our benchmarks include several additional features, 4 The pay exceeds the prorated US minimum wage. such has much more complex sentences from the legal domain, a clear set of rubrics for evaluation, and crowdsourced human judgements to scale.

Model Performance
Previous work reports the model performance on this task using two metrics: BLEU on the entire benchmark and manual ratings on a small subset. However, BLEU has long been shown to have little correlation with human judgements in text simplification 5 (Sulem et al., 2018). While other alternatives exist, the focus of our work is not the metrics, but rather the quality and difficulty of benchmarks, which can be illustrated no better than by human evaluation. Previously, manual evaluation has been done without a well-established rubric on what makes a Split and Rephrase rewrite correct. To address these problems, we use crowdsourcing following the process of Phase 2, by asking 3 crowd workers to rate model outputs based on the 6 finegrained criteria described above. Table 3 shows the average crowd ratings and BLEU score for each combination of a model and a benchmark. We consider the state-of-theart seq2seq model trained on WikiSplit (Botha et al., 2018) and our rule-based model. We use all rewrites from Phase 1 including those not included in our benchmarks to measure human performance.
Both the rule-based and seq2seq model have large rooms for improvement, as they significantly underperform crowd workers in almost all criteria, with significantly lower performance in our proposed benchmarks than in WebSplit. Even for crowd workers, the percentage of overall correct is less than 70% in our new benchmarks, whose complex sentences are much more challenging to Split and Rephrase.

Reliability of the Crowd
Can we use crowdsourcing to evaluate models no less reliably as experts or authors, as done in previous work? As experts of this task, we manually rate a subset of model-output rewrites as the ground truth for rating, and compare it against the crowd's rating. Since there are 3 benchmarks and 3 models (including human, whose outputs are crowd rewrites we have collected in Wiki-BM and Cont-BM, but not WebSplit), there are 8 combinations in total. From the crowd ratings of these combinations, we assign each complex-output pair into one of 4 buckets, determined by the number of correct ratings out of 3 crowd ratings. For each bucket, we sample 2 complex-output pairs. In total, 8 × 4 × 2 = 64 complex-output pairs are sampled. The expert rates them independently following the same 6 criteria as the crowd workers. This gives the proportion of expert's correct ratings among each bucket.
These statistics allow us to fit a beta distribution for expert rating conditional on each crowd rating bucket, using Laplace prior smoothing. The results are shown in Figure 1. Each distribution corresponds to a bucket with 0, 1, 2, or 3 out of 3 correct crowd ratings. For example, the right-most curve represents the probability density function where both the expert and the 3 crowd raters agree on a correct rating. According to the figure with a 90% one-sided confidence, when all 3 crowd raters rate a rewrite as correct, the expert also rates correct in more than around 80% of the samples; when none of the 3 crowd raters rate a rewrite as correct, the expert rates correct for less than around 10% of the samples.
This shows that crowdsourcing can be a reliable way to evaluate models for this task, with variable reliability depending on the number of raters per sample and their agreement.

Conclusion
After showing the flaws of the current benchmark in Split and Rephrase, we release two crowdsourced benchmarks containing significantly more diverse syntax. Using fine-grained crowdsourcing evaluation on 6 well-defined criteria, we show that they provide a greater challenge to models.

A Algorithm of the Rule-Based Model
Given a complex sentence, the model runs the following processes once each. Wh Handling Using semantic role labeling, the model looks for a Relational Argument (R-ARG), and the Subject Argument (asserted to be the ARG preceding the R-ARG). Then, a split is made with the Relational Argument replaced by the Subject Argument. Conjunction Handling The model looks for the word "and". Using semantic role labeling, if the word following "and" is an argument (ARG), assert that "and" is followed by a sentence, and a split is made. Or, if the word following "and" is a verb (V), the model asserts the Subject Argument to be the ARG preceding the V; a split is made with "and" replaced by the Subject Argument. Insertion Handling Using dependency parsing, the model looks for a node with type participle modifier, relative clause modifier, prepositional modifier, adjective modifier, or appositional modifier. The clause with the node as the root is extracted, prepended with the subject, and split as a new simple sentence.  Table 4: Spearman's correlation between sentence-level BLEU and human judgement on 6 criteria by combinations of benchmarks and models. †: the correlation coefficient is not statistically significant with α = .05.
is known to be prolific.

Sensical and understandable
Rewritten ( at the wrong place or unnecessarily? 6. Does the Rewritten text have one or more sentences that should be further split? Each question is accompanied by a positive and negative example, the same as in the previous section. The crowd workers answer the first two questions by dragging a draw bar between "Strongly Disagree" and "Strongly Agree", and the last four questions by choosing "yes/no" radio boxes.

C Correlation Between BLEU and Crowd Workers
Does BLEU correlate with human judgement on a large scale? To answer this, we collect crowdsourced ratings of model outputs. With 3 benchmark datasets (WebSplit, Wiki-BM and Cont-BM) and two models (seq2seq and rule-based), we sample 100 complex sentence and output rewrite pairs from each combination, resulting in 600 in total. 6 Then, we run the same crowdsourcing project as Phase 2 (Sec. 5.2.2) with these 600 pairs, for each of which we collect ratings from 3 crowd raters. The crowd raters are asked to rate based on the same 6 criteria as before (Sec. 3.1). As defined before, if a rating includes 5 for the first two criteria and "no" for the other four, it is considered correct. The Spearman's correlation coefficients between sentence-level BLEU and crowd ratings in each 6 criteria are shown in Table 4. While BLEU has higher correlation with crowd raters on whether the rewrite is sensical or grammatical, most correlation coefficients are less than .5, and many do not imply a positive correlation at all.
This reinforces the claim that BLEU is not a suitable evaluation metric for the Split and Rephrase task, because it has little correlation with human (crowd) judgement.