Reference-less Quality Estimation of Text Simplification Systems

The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.


Introduction
Text simplification (hereafter TS) has received increasing interest by the scientific community in recent years.It aims at producing a simpler version of a source text that is both easier to read and to understand, thus improving the accessibility of text for people suffering from a range of disabilities such as aphasia (Carroll et al., 1998) or dyslexia (Rello et al., 2013), as well as for second language learners (Xia et al., 2016) and people with low literacy (Watanabe et al., 2009).This topic has been researched for a variety of languages such as English (Zhu et al., 2010;Wubben et al., 2012;Narayan and Gardent, 2014;Xu et al., 2015), French (Brouwers et al., 2014), Spanish (Saggion et al., 2011), Portuguese (Specia, 2010), Italian (Brunato et al., 2015) and Japanese (Goto et al., 2015). 1  One of the main challenges in TS is finding an adequate automatic evaluation metric, which is necessary to avoid the time-consuming human evaluation.Any TS evaluation metric should take into account three properties expected from the output of a TS system, namely: • Grammaticality: how grammatically correct is the TS system output?
• Meaning preservation: how well is the meaning of the source sentence preserved in the TS system output?
• Simplicity: how simple is the TS system output? 2 TS is often reduced to a sentence-level problem, whereby one sentence is transformed into a simpler version containing one or more sentences.In this paper, we shall make use of the terms source (sentence) and (TS system) output to respectively denote a sentence given as an input to a TS system and the simplified, single or multi-sentence output produced by the system.
TS, seen as a sentence-level problem, is often viewed as a monolingual variant of (sentencelevel) MT.The standard approach to automatic TS evaluation is therefore to view the task as a translation problem and to use machine translation (MT) 1 Note that text simplification has also been used as a preprocessing step for other natural language processing tasks such as machine translation (Chandrasekar et al., 1996) and semantic role labelling (Vickrey and Koller, 2008). 2 There is no unique way to define the notion of simplicity in this context.Previous works often rely on the intuition of human annotators to evaluate the level of simplicity of a TS system output.evaluation metrics such as BLEU (Papineni et al., 2002).However, MT evaluation metrics rely on the existence of parallel corpora of source sentences and manually produced reference translations, which are available on a large scale for many language pairs (Tiedemann, 2012).TS datasets are less numerous and smaller.Moreover, they are often automatically extracted from comparable corpora rather than strictly parallel corpora, which results in noisier reference data.For example, the PWKP dataset (Zhu et al., 2010) consists of 100,000 sentences from the English Wikipedia automatically aligned with sentences from the Simple English Wikipedia based on term-based similarity metrics.It has been shown by Xu et al. (2015) that many of PWKP's "simplified" sentences are in fact not simpler or even not related to their corresponding source sentence.Even if better quality corpora such as Newsela do exist (Xu et al., 2015), they are costly to create, often of limited size, and not necessarily open-access.This creates a challenge for the use of referencebased MT metrics for TS evaluation.However, TS has the advantage of being a monolingual translation-like task, the source being in the same language as the output.This allows for new, nonconventional ways to use MT evaluation metrics, namely by using them to compare the output of a TS system with the source sentence, thus avoiding the need for reference data.However, such an evaluation method can only capture at most two of the three above-mentioned dimensions, namely meaning preservation and, to a lesser extent, grammaticality.
Previous works on reference-less TS evaluation include Štajner et al. (2014), who compare the behaviour of six different MT metrics when used between the source sentence and the corresponding simplified output.They evaluate these metrics with respect to meaning preservation and grammaticality.We extend their work in two directions.Firstly, we extend the comparison to include the degree of simplicity achieved by the system.Secondly, we compare additional features, including those used by Štajner et al. (2016a), both individually, as elementary metrics, and within multi-feature metrics.To our knowledge, no previous work has provided as thorough a comparison across such a wide range and combination of features for the reference-less evaluation of TS.
First we review available text simplification evaluation methods and traditional quality estimation features.We then present the QATS shared task and the associated dataset, which we use for our experiments.Finally we compare all methods in a reference-less setting and analyze the results.
2 Existing evaluation methods 2.1 Using MT metrics to compare the output and a reference TS can be considered as a monolingual translation task.As a result, MT metrics such as BLEU (Papineni et al., 2002), which compare the output of an MT system to a reference translation, have been extensively used for TS (Narayan and Gardent, 2014;Štajner et al., 2015;Xu et al., 2016).
Other successful MT metrics include TER (Snover et al., 2009), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005), but they have not gained much traction in the TS literature.These metrics rely on good quality references, something which is often not available in TS, as discussed by Xu et al. (2015).Moreover, Štajner et al. (2015) and Sulem et al. (2018a) showed that using BLEU to compare the system output with a reference is not a good way to perform TS evaluation, even when good quality references are available.This is especially true when the TS system produces more than one sentence for a single source sentence.

Using MT metrics to compare the output and the source sentence
As mentioned in the Introduction, the fact that TS is a monolingual task means that MT metrics can also be used to compare a system output with its corresponding source sentence, thus avoiding the need for reference data.Following this idea, Štajner et al. (2014) found encouraging correlations between 6 widely used MT metrics and human assessments of grammaticality and meaning preservation.However MT metrics are not relevant for the evaluation of simplicity, which is why they did not take this dimension into account.Xu et al. (2016) also explored the idea of comparing the TS system output with its corresponding source sentence, but their metric, SARI, also requires to compare the output with a reference.In fact, this metric is designed to take advantage of more than one reference.It can be applied when only one reference is available for each source sentence, but its results are better when multiple ref-erences are available.
Attempts to perform Quality Estimation on the output of TS systems, without using references, include the 2016 Quality Assessment for Text Simplification (QATS) shared task ( Štajner et al., 2016b), to which we shall come back in section 3. Sulem et al. (2018b) introduce another approach, named SAMSA.The idea is to evaluate the structural simplicity of a TS system output given the corresponding source sentence.SAMSA is maximized when the simplified text is a sequence of short and simple sentences, each accounting for one semantic event in the original sentence.It relies on an in-depth analysis of the source sentence and the corresponding output, based on a semantic parser and a word aligner.A drawback of this approach is that good quality semantic parsers are only available for a handful of languages.The intuition that sentence splitting is an important sub-task for producing simplified text motivated Narayan et al. (2017) to organize the Split and Rephrase shared task, which was dedicated to this problem.

Other metrics
One can also estimate the quality of a TS system output based on simple features extracted from it.
For instance, the QUEST framework for quality estimation in MT gives a number of useful baseline features for evaluating an output sentence (Specia et al., 2013).These features range from simple statistics, such as the number of words in the sentence, to more sophisticated features, such as the probability of the sentence according to a language model.Several teams who participated in the QATS shared task used metrics based on this framework, namely SMH ( Štajner et al., 2016a), UoLGP (Rios and Sharoff, 2015) and UoW (Béchara et al., 2015).
Readability metrics such as Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) (Kincaid et al., 1975) have been extensively used for evaluating simplicity.These two metrics, which were shown experimentally to give good results, are linear combinations of the number of words per sentence and the number of syllables per word, using carefully adjusted weights.

Methodology
Our goal is to compare a large number of ways to perform TS evaluation without a reference.To this end, we use the dataset provided in the QATS shared task.We first compare the behaviour of elementary metrics, which range from commonly used metrics such as BLEU to basic metrics based on a single low-level feature such as sentence length.We then compare the effect of aggregating these elementary metrics into more complex ones and compare our results with the state of the art, based on the QATS shared task data and results.

The QATS shared task
The data from the QATS shared task ( Štajner et al., 2016b) consists of a collection of 631 pairs of english sentences composed of a source sentence extracted from an online corpus and a simplified version thereof, which can contain one or more sentences.This collection is split into a training set (505 sentence pairs) and a test set (126 sentence pairs).Simplified versions were produced automatically using one of several TS systems trained by the shared task organizers.Human annotators labelled each sentence pair using one of the three labels Good, OK and Bad on each of the three dimensions: grammaticality, meaning preservation and simplicity3 .An overall quality label was then automatically assigned to each sentence pair based on its three manually assigned labels using a method detailed in ( Štajner et al., 2016b).Distribution of the labels and examples are presented in FIGURE 1 and TABLE 1.
The goal of the shared task is, for each sentence in the test set, to either produce a label (Good, OK, Bad) or a raw score estimating the overall quality of the simplification for each of the three dimensions.Raw score predictions are evaluated using the Pearson correlation with the ground truth labels, while actual label prediction are evaluated using the weighted F1-score.The shared task is described in further details on the QATS website4 .

Features
In our experiments, we compared about 60 elementary metrics, which can be organised as follows: • MT metrics , BLEU 4gram and seven smoothing methods5 from NLTK (Bird and Loper, 2004).
• Readability metrics and other sentence-level features: FKGL and FRE, numbers of words, characters, syllables...
• Metrics based on the baseline QUEST features (17 features) (Specia et al., 2013), such as statistics on the number of words, word lengths, language model probability and ngram frequency.
• Metrics based on other features: frequency table position, concreteness as extracted from Brysbaert et al.'s 2014 list, language model probability of words using a convolutional sequence to sequence model from (Gehring et al., 2017), comparison methods using pretrained fastText word embeddings (Mikolov et al., 2018) or Skip-thought sentence embeddings (Kiros et al., 2015).
TABLE 2 lists 30 of the elementary metrics that we compared, which are those that we found to correlate the most with human judgments on one or more of the three dimensions (grammaticality, meaning preservation, simplicity).

Experimental setup
Evaluation of elementary metrics We rank all features by comparing their behaviour with human judgments on the training set.We first compute for each elementary metric the Pearson correlation between its results and the manually assigned labels for each of the three dimensions.We then rank our elementary metrics according to the absolute value of the Pearson correlation. 6  Training and evaluation of a combined metric We use our elementary metrics as features to train classifiers on the training set, and evaluate their performance on the test set.We therefore scale them and reduce the dimensionality with a 25-component PCA 7 , then train several regression algorithms 8 and classification algorithms 9 using scikit-learn (Pedregosa et al., 2011).For each dimension, we keep the two models performing best on the test set and add them in the leaderboard of the QATS shared task (TABLE 4), naming them with the name of the regression algorithm they were built with.

Comparing elementary metrics
FIGURE 3 ranks all elementary metrics given their absolute Pearson correlation on each of the three dimensions.
Grammaticality N -gram based MT metrics have the highest correlation with human grammaticality judgments.METEOR seems to be the best, probably because of its robustness to synonymy, followed by smoothed BLEU (BLEUSmoothed in 2).This indicates that relevant grammaticality information can be derived from the source sentence.We were expecting that information contained in a language model would help achieving better results (AvgLMProbsOutput), but MT metrics correlate better with human judgments.We deduce that the grammaticality information contained in the source is more specific and more helpful for evaluation than what is learned by the language model. 6The code is available on Github at https: //github.com/facebookresearch/text-simplification-evaluation 7 We used PCA instead of feature selection because it performed better on the validation set.The number of component was tuned on the validation set as well.
Meaning preservation It is not surprising that meaning preservation is best evaluated using MT metrics that compare the source sentence to the output sentence, with in particular smoothed BLEU, BLEU 3gram and METEOR.Very simple features such as the percentage of words in common between source and output also rank high.Surprisingly, word embedding comparison methods do not perform as well for meaning preservation, even when using word alignment.
Simplicity Methods that give the best results are the most straightforward for assessing simplicity, namely word, character and syllable counts in the output, averaged over the number of output sentences.These simple features even outperform the traditional, more complex metrics FKGL and FRE.As could be expected, we find that metrics with the highest correlation to human simplicity judgments only take the output into account.Exceptions are the NBSourceWords and NBSour-cePunct features.Indeed, if the source sentence has a lot of words and punctuation, and is therefore likely to be particularly complex, then the output will most likely be less simple as well.We also expected word concreteness ratings and position in the frequency table to be good indicators of simplicity, but it does not seem to be the case here.Structural simplicity might simply be more important than such more sophisticated components of the human intuition of simple text.
Discussion Even if counting the number of words or comparing n-grams are good proxies for the simplification quality, they are still very superficial features and might miss some deeper and more complex information.Moreover the fact that grammaticality and meaning preservation are best evaluated using n-gram-based comparison metrics might bias the TS models towards copying the source sentence and applying fewer modifications.
Syntactic parsing or language modelling might capture more insightful grammatical information and allow for more flexibility in the simplification model.Regarding meaning preservation, semantic analysis or paraphrase detection models would also be good candidates for a deeper analysis.
Warning note We should be careful when interpreting these results as the QATS dataset is relatively small.We compute confidence intervals on our results, and find them to be non-negligible, yet without putting our general observations into

Combination of all features with trained models
We also combine all elementary metrics and train an evaluation models for each of the three dimensions.It is surprising to us that the aggregation of multiple elementary features would score worse than the features themselves.However, we observe a strong discrepancy between the scores obtained on the train and test set, as illustrated by TABLE 3. We also observed very large confidence intervals in terms of Pearson correlation.For instance our lasso model scores 0.33 ± 0.17 on the test set for grammaticality.This should observe caution when interpreting Pearson scores on QATS.
F1-score for classifiers (assigning labels) On the classification task, our models seem to score best for meaning preservation, simplicity and overall, and third for grammaticality.This seems to confirm the importance of considering a large ensemble of elementary features including lengthbased metrics to evaluate simplicity.

Conclusion
Finding accurate ways to evaluate text simplification (TS) without the need for reference data is a key challenge for TS, both for exploring new approaches and for optimizing current models, in particular those relying on unsupervised, often MT-inspired models.
We explore multiple reference-less quality evaluation methods for automatic TS systems, based on data from the 2016 QATS shared task.We rely on the three key dimensions of the quality of a TS system: grammaticality, meaning preservation and simplicity.
Our results show that grammaticality and meaning preservation are best assessed using n-grambased MT metrics evaluated between the output and the source sentence.In particular, METEOR and smoothed BLEU achieve the highest correlation with human judgments.These approaches even outperform metrics that make an extensive use of external data, such as language models.This shows that a lot of useful information can be obtained from the source sentence itself.
Regarding simplicity, we observe that counting the number of characters, syllables and words provides the best results.In other words, given the currently available metrics, the length of a sentence seems to remain the best available proxy for its simplicity.
However, given the small size of the QATS dataset and the high variance observed in our experiments, these results must be taken with a pinch of salt and will need to be confirmed on a larger dataset.Creating a larger annotated dataset as well as averaging multiple human annotations for each pair of sentences would help reducing the variance of the experiments and confirming our findings.
In future work, we shall explore richer and more complex features extracted using syntactic and semantic analyzers, such as those used by the SAMSA metric, and paraphrase detection models.
Finally, it remains to be understood how we can optimize the trade-off between grammaticality, meaning preservation and simplicity, in order to build the best possible comprehensive TS metric in terms of correlation with human judgments.Unsurprisingly, optimizing one of these dimensions often leads to lower results on other dimensions (Schwarzer and Kauchak, 2018).For instance, the best way to guarantee grammaticality and meaning preservation is to leave the source sentence unchanged, thus resulting in no simplification at all.Improving TS systems will require better global TS evaluation metrics.This is especially true when considering that TS is in fact a multiply defined task, as there are many different ways of simplifying a text, depending on the different categories of people and applications at whom TS is aimed.

Figure 1 :
Figure 1: Label repartition on the QATS Shared task

Table 1 :
OriginalAll three were arrested in the Toome area and have been taken to the Serious Crime Suite at Antrim police station.goodgoodgood good syntactic Simple All three were arrested in the Toome area.All three have been taken to the Serious Crime Suite at Antrim police station.Examples from the training dataset of QATS.Differences between the original and the simplified version are presented in bold.This table is adapted fromŠtajner et al. (2016b).

Table 2 :
Brief description of 30 of our most relevant elementary metrics

Table 3 :
Pearson correlation with human judgments of elementary metrics ranked by absolute value on training set (15 best metrics for each dimension).question.For instance, METEOR, which performs best on grammaticality, has a 95% confidence interval of 0.36 ± 0.08 on the training set.These results are therefore preliminary and should be validated on other datasets.
TABLE 4a presents our two best regressors in validation for each of the dimensions and TA-BLE 4b for classifiers.

Table 4 :
QATS leaderboard.Results in bold are our additions to the original leaderboard.We only select the two models that rank highest during cross-validation.