Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

We present a recurrent neural network based system for automatic quality estimation of natural language generation (NLG) outputs, which jointly learns to assign numerical ratings to individual outputs and to provide pairwise rankings of two different outputs. The latter is trained using pairwise hinge loss over scores from two copies of the rating network. We use learning to rank and synthetic data to improve the quality of ratings assigned by our system: We synthesise training pairs of distorted system outputs and train the system to rank the less distorted one higher. This leads to a 12% increase in correlation with human ratings over the previous benchmark. We also establish the state of the art on the dataset of relative rankings from the E2E NLG Challenge (Dusek et al., 2019), where synthetic data lead to a 4% accuracy increase over the base model.


Introduction
While automatic output quality estimation (QE) is an established field of research in other areas of NLP, such as machine translation (MT) (Specia et al., 2010(Specia et al., , 2018, research on QE in natural language generation (NLG) from structured meaning representations (MR) such as dialogue acts is relatively recent Ueffing et al., 2018) and often focuses on output fluency only (Tian et al., 2018;Kann et al., 2018). In contrast to traditional metrics, QE does not rely on gold-standard human reference texts (Specia et al., 2010), which are expensive to obtain, do not cover the full output space, and are not accurate on the level of individual outputs Reiter, 2018). Automatic QE for NLG has several possible use cases that can improve NLG quality and reliability. For example, rating individual NLG outputs allows to ensure a minimum output quality and engage a backup, e.g., template-based NLG system, if a certain threshold is not met. Relative ranking of multiple NLG outputs can be used directly within a system to rerank n-best outputs or to guide system development, selecting optimal system parameters or comparing to state of the art.
In this paper, we present a novel model that jointly learns to perform both tasks-rating individual outputs as well as pairwise ranking. We show that this leads to performance improvements over previously published results . Our model is portable, since we do not assume any specific input schema and only rely on ratings of the text output, which are relatively easy to obtain, e.g. through crowdsourcing for a small number of outputs of an initial NLG system. The model learns to rank or rate according to any criterion annotated in the data, such as adequacy, fluency, or overall quality (see e.g., Wen et al., 2015;Manishina et al., 2016;. Our main contributions are as follows: • A novel, domain-and input representationagnostic, and conceptually simple model for NLG QE, which jointly learns ratings and pairwise rankings. It is able to seamlessly switch between the two and is directly applicable for n-way ranking (see Section 3). Crucially, it does not require human-authored references during inference.
• An original methodology for synthetically generating training instances for pairwise ranking based on introducing errors (see Section 4).
• A significant, 12% relative improvement in Pearson correlation with human ratings over results previously published on the dataset of , as well as the first pairwise ranking results for NLG QE on the E2E ranking dataset of Dušek et al. (2019), with significant improvements over the baseline due to synthetic training instance generation (see Sections 5 and 6).
Both datasets are freely available, and we release our experimental code on GitHub. 1

The Task(s)
The task of NLG QE for ratings is to assign a numerical score to a single NLG output, given its input MR, such as a dialogue act (consisting of the main intent, attributes and values). The score can be e.g. on a Likert scale in the 1-6 range . In a pairwise ranking task, the QE system is given two outputs of different NLG systems for the same MR, and decides which one has better quality (see Figure 1). As opposed to automatic word-overlap-based metrics, such as BLEU (Papineni et al., 2002) or METEOR (Lavie and Agarwal, 2007), no human reference texts for the given MR are required. This widens the scope of possible applications -QE systems can be used for previously unseen MRs.

Model
Our model is a direct extension of the freely available RatPred system . The original RatPred model assigns numerical ratings to single outputs and is a dual-encoder (Lu et al., 2017), consisting of two GRU-based recurrent neural networks (Cho et al., 2014) encoding the MR and the system output, followed by fully connected layers and a final linear layer providing the score. The system is trained using squared error loss, and it uses dropout over embeddings (Hinton et al., 2012).
We make RatPred's encoders bidirectional and add a novel extension to allow pairwise ranking-a second copy of the system output encoder plus the fully connected layers and linear layer ( Figure 2). All network parameters are shared among the two copies. This way, the network is able to rate two NLG outputs at once. We add a simple difference operator on top of this; the pairwise rank is computed as the difference between the two predicted scores. In addition to the squared loss for rating, we incur pairwise hinge loss for ranking. The final loss function looks as follows: The datasets can be downloaded under the following links: https://github.com/jeknov/EMNLP_ 17_submission, http://www.macs.hw.ac.uk/ InteractionLab/E2E/.
Our code is available at https://github.com/tuetschek/ratpred. I indicates if the current instance is a ranking-based one (value of 1 for ranking and 0 for rating, effectively a mask to only incur the correct loss). y denotes the true score for a NLG output,ŷ andŷ denote scores assigned by the model for (up to) two NLG outputs. 2 Note thatŷ is ignored in rating instances, while a true score y is ignored for ranking. This way, the same network performs ranking and rating jointly, and it can be exposed to training instances of both types in any order. Our model is also directly applicable to n-way rankings-using it to score a group of NLG outputs and comparing the scores is equivalent to comparing the pairwise ranking.
Jointly learning to rank and rate was first introduced by Sculley (2010) for support vector machines and similar approaches have been applied for image classification (Park et al., 2017;Liu et al., 2018) as well as audio classification (Lee et al., 2016), However, we argue that the application for text classification/QE is novel, as is the implementation as a single neural network with two parts that share parameters, capable of training from mixed ranking/rating instances with masking to incur the proper loss.

Synthetic Training Data Generation
We use RatPred's code to generate synthetic rating instances from both NLG outputs and humanauthored texts by distorting the text and lowering its score (i.e., randomly removing or adding words; cf.  and Figure 3 for details). We also create synthetic training pairs by using the same NLG output/human-authored text under two different levels of distortion (e.g., one vs. two artificially introduced errors). The system is then trained to rank higher the version of the text with fewer errors (see Figure 3 for an example). This novel approach can be used to generate synthetic training data for both ranking and rating tasks-in a rating task, the generated ranking instances are simply mixed among the original training instances for rating, and the model uses both kinds for training. Note that synthetic data are never used for validation or testing in any of our setups.

Instance
Rating/Rank MR inform only match(name='hotel drisco', area='pacific heights') 4 RNNLG output the only match i have for you is the hotel drisco in the pacific heights area.

ZHANG output
The Cricketers is a children friendly coffee shop near Café Sicilia with a high customer rating . better

TR2 output
The Cricketers can be found near the Café Sicilia. Customers give this coffee shop a high rating. It's family friendly. worse Figure 1: Examples for NLG output quality rating (top, from the NEM dataset) and ranking (bottom, from the E2E rankings dataset); RNNLG, ZHANG and TR2 are NLG systems. See Section 5.1 for details on the datasets. 5 Experimental Setup

Datasets
We experiment on the following two datasets, both in the restaurant/hotel information domain: • NEM 3 ) -Likert-scale rated outputs (scores 1-6) of 3 NLG systems over 3 datasets, totalling 2,460 instances.
• E2E system rankings (Dušek et al., 2019) -outputs of 21 systems on a single NLG dataset with 2,979 5-way relative rankings. We choose these two datasets because they contain human-assessed outputs from a variety of NLG systems. Another candidate is the WebNLG corpus (Gardent et al., 2017), which we leave for future work due to MR format differences.
Although both selected datasets contain ratings for multiple criteria (informativeness, naturalness and quality for NEM and the latter two for E2E), we follow  and focus on the overall quality criterion in our experiments as it takes both semantic accuracy and fluency into account.
We use RatPred's preprocessing, synthetic data generation, and 5-way cross-validation split on the NEM dataset. In addition, we generate synthetic training pairs as described in Section 4. We convert the 5-way rankings from the E2E set to pairwise rankings (Sakaguchi et al., 2014) (leaving out ties), which produces 15,001 instances. We split the data into training, development and test sections in an 8:1:1 ratio, ensuring that each section contains NLG outputs for different MRs (Lampouras and Vlachos, 2016). 4 In addition to the human-assessed NLG outputs themselves, human-authored training data for the NLG systems are also available and are used for synthetic instances. We use a partial delexicalisation (replacing names with placeholders). 5

Model Settings
We evaluate our model in several configurations, with increasing amounts of synthetic training data. Note that even setups using training human references (i.e. additional in-domain data) are still "referenceless"-they do not use human references for test MRs. Setups using human references for validation and test MRs ("reference-aided"; marked with "*" in Table 1) are not referenceless and are mainly shown for comparison with .
We use the same network parameters for all setups, selected based on a small-scale grid search on the development data of both sets, taking training speed into consideration. 6 As a result, we use a network with fewer parameters than , which makes our base setup worseperforming than the original base setup, despite our use of bidirectional encoders (cf. Section 6). On MR inform(name='house of nanking',food=chinese) RNNLG output (0 errors) house of nanking serves chinese food . 1 error house of nanking restaurant chinese food .

errors
house of nanking serves food chinese food cheaply .

errors
food house of nanking house of nanking serves chinese chinese food . Figure 3: Synthetic data generation example. Synthesising errors: The original NLG output is distorted by introducing errors (underlined) of the following types: Words in the texts are removed or duplicated at original or random positions, random words from a dictionary learned from training data are used to replace current words or added at random positions. Other words are preferred over articles and punctuation in making the changes (see Dušek et al., 2017 for details). Rating instances: We use the same settings as  for synthetic individual rating instancesgenerating up to 4 errors and lowering the target rating by 1 each time (lowering by 2 if the original value was 6). We are able to generate more synthetic rating instances since  did not use all available NLG system outputs due to a bug in their code (cf. Table 1). Ranking Instances: Following our new method, pairs of outputs with a different number of errors (e.g., 0-1, 1-3) are then sampled as synthetic training instances for ranking. In our setting, we introduce up to 4 errors and create instances for all numbers of errors against the original (0-1 through 0-4), plus a set of 5 other, randomly chosen instances (e. g. 1-3, 2-4). We use both rating and ranking synthetic instances for NEM data and only ranking synthetic instances for E2E data. the other hand, training runs several times faster. We use Adam (Kingma and Ba, 2015) for training, evaluating on the validation set after each epoch and selecting the best-performing configuration. Synthetic data are removed after 50 (out of 100) epochs. Following , we run all experiments with 5 different random initializations of the networks and report averaged results.

Evaluation Metrics
On the NEM data, we follow  to compare with their results. We use Pearson correlation of system-provided ratings with human ratings as our primary evaluation metric; we also measure Spearman rank correlation, mean absolute error (MAE) and root mean squared error (RMSE). On the E2E data, we use pairwise ranking accuracy (or Precision@1), a common ranking metric. We also measure mean ranking loss, i.e., mean score difference in wrongly rated instances.

Results and Discussion
The results on the NEM dataset in Table 1 show that our improved synthetic data generation methods bring significant improvements in correlation. 7 On the other hand, they worsen MAE and RMSE scores slightly, probably due to missing supervision on the exact rating in synthetic ranking instances. Compared to , we get a 12% increase in Pearson correlation on the best referenceless configurations; our best referenceless system method outperforms even 's reference-aided system. Note that the absolute correlations, while still not ideal, are much higher than those achieved by word-overlap-based metrics such as BLEU, which stay well below 0.1.
Our reference-aided setup did not improve with synthetic ranking pairs. Probably this is because there are already enough training data for the domain. Furthermore, this system is more prone to overfit the validation set (exploiting validation references during training). The intra-class correlation coefficient (ICC) of 0.45 measuring rater agreement on the NEM data as reported by  ('moderate agreement') also suggests that a certain level of noise may hinder further improvements on this dataset. Table 2 shows our results on the E2E data. Here, all configurations perform well above random chance (i.e. accuracy of 0.5). Using the synthesised ranking pairs brings a small but statistically significant 8 improvement over the base model (3% using only NLG system outputs, additional 1% if also human references from NLG system training data are used for synthetic pairs generation).
We also explored training the system using data from both sets; however, this did not bring performance improvements, probably due to different text styles in the two datasets -the NEM data in-   Figure 3). Boldface denotes configurations of our system that are significantly better than all previous ones according to the Williams (1959) test (p < 0.01). Values for baseline metrics and the original RatPred system are taken over from . Configurations marked with "*" use human references for test instances (this includes word-overlap-based metrics such as BLEU).

Related Work
QE has been an active topic in many NLP tasksimage captioning (Anderson et al., 2016), dialogue response generation (Lowe et al., 2017), grammar correction (Napoles et al., 2016) or text simplification (Martin et al., 2018)-with MT being perhaps the most prominent area (Specia et al., 2010;Avramidis, 2012;Specia et al., 2018). QE for NLG recently saw an increase of focus in various subtasks, such as title generation (Ueffing et al., 2018;Camargo de Souza et al., 2018) or content selection and ordering (Wiseman et al., 2017). Furthermore, several recent studies focus on predicting NLG fluency only, e.g., (Tian et al., 2018;Kann et al., 2018). However, apart from our work,  is the only general NLG QE system to our knowledge, which aims to predict the overall quality of a generated utterance, where quality includes both fluency and semantic coverage of the MR. Note that the correct semantic coverage of MRs is a problem for many neural NLG approaches (Gehrmann et al., 2018;Dušek et al., 2019;Nie et al., 2019). Compared to , our model is able to jointly rate and rank NLG outputs and includes better synthetic training data creation methods.
Our approach to QE is similar to adversarial evaluation-distinguishing between human-and machine-generated outputs (Goodfellow et al., 2014). This approach is employed in generators for random text (Bowman et al., 2016) and dialogue responses (Kannan and Vinyals, 2016;Li et al., 2017;Bruni and Fernandez, 2017). We argue that our approach is more explainable with users being able to reason with the ordinal output score.