Semantic Noise Matters for Neural Natural Language Generation

Neural natural language generation (NNLG) systems are known for their pathological outputs, i.e. generating text which is unrelated to the input specification. In this paper, we show the impact of semantic noise on state-of-the-art NNLG models which implement different semantic control mechanisms. We find that cleaned data can improve semantic correctness by up to 97%, while maintaining fluency. We also find that the most common error is omitting information, rather than hallucination.

In this paper, we examine the impact of this type of semantic noise on two state-of-the-art NNLG models with different semantic control mechanisms: TGen (Dušek and Jurčíček, 2016) and SC-LSTM (Wen et al., 2015).In particular, we investigate the systems' ability to produce fact-accurate text, i.e. without omitting or hallucinating information, in the presence of semantic noise. 2 We find that: • training on cleaned data reduces slot-error rate up to 97% on the original evaluation data; • testing on cleaned data is challenging, even for models trained on cleaned data, likely due to increased MR diversity in the cleaned dataset; and • TGen performs better than SC-LSTM, even when cleaner training data is available.We hypothesise that this is due to differences in how the two systems handle semantic input and the degree of delexicalization that they expect.
In addition, we release our code and a cleaned version of the E2E data with this paper. 3

Mismatched Semantics in E2E Data
The E2E dataset contains input MRs and corresponding target human-authored textual references in the restaurant domain.MRs here are sets of attribute-value pairs (see Figure 1).Most MRs in the dataset have multiple references (8.1 on average).These were collected using crowdsourcing, leading to noise when crowd workers did not verbalise all attributes or added information not present in the MR.According to Dušek et al. (2019), the multiple references should help NLG systems abstract from the noise.However, most NLG systems in the E2E challenge in fact produced noisy outputs, suggesting that they were unable to learn to ignore noise in the training input.
Problems with the semantic accuracy in training data is not unique to the E2E dataset.Howcroft et al. (2017) collected a corpus of paraphrases differing with respect to information density for use in training NLG systems and found that subjects' paraphrases dropped about 5% of the slot-value pairs from the original texts and changed the val- ues for approximately 10% of the slot-value pairs.As a result of these changes and the insertion of new facts, only 61% of the corpus contained all and only the intended propositions.This is similar to what Eric et al. (2019) found in their work on the MultiWOZ 2.0 dataset: correcting the dialogue state annotations resulted in changes to about 40% of the dialogue turns in their dataset.These findings suggest that efforts to create more accurate training data-whether through stricter crowdsourcing protocols, conducting follow-up annotations (cf.Eric et al., 2019), or automated cleanup heuristics like we report here-are likely necessary in the NLG and dialogue systems communities.

Cleaning the Meaning Representations
To produce a cleaned version of the E2E data, we used the original human textual references, but paired them with correctly matching MRs. 4o this end, we reimplemented the slot matching script of Reed et al. (2018), which tags MR slots and values using regular expressions.We tuned our expressions based on the first 500 instances from the E2E development set and ran the script on the full dataset, producing corrected MRs for all human references (see Figure 1).The differences against the original MRs allow us to compute the semantic/slot error rate (SER; Wen et al., 2015;Reed et al., 2018;Dušek et al., 2019): To guarantee the integrity of the test set, we removed instances from the TRAIN (training) and DEV (development) sets that overlapped the TEST set.This resulted in 20% reduction for TRAIN and ca.8% reduction for DEV in terms of references (see Table 1).On the other hand, the number of distinct MRs rose sharply after reannotation; the MRs also have more variance in the number of attributes.This means that the cleaned dataset is more complex overall, with fewer references per MR and more diverse MRs.We manually evaluated 200 randomly chosen instances from the cleaned TRAIN set to check the accuracy of the slot matching script.We found that the slot matching script itself has a SER of 4.2%, with 39 instances (19.5%) not 100% correctly rated.This is much lower than the E2E dataset authors' own manual assessment of ca.40% noisy instances (Dušek et al., 2019) and the script's rating of the whole dataset (mean SER: 16.37%),and comparable to the slot matching script of Juraska et al. (2018) evaluated on the same data.5

Evaluating the Impact on Neural NLG
We chose two recent neural end-to-end NLG systems, which represent two different approaches to semantic control and have been widely used and extended by the research community.

TGen
TGen (Dušek and Jurčíček, 2016) is the baseline system used in the E2E challenge. 6TGen is in essence a vanilla sequence-to-sequence (seq2seq) model with attention (Bahdanau et al., 2015) using LSTM cells where input MRs are encoded as sequences of triples in the form (dialogue act, slot, value). 7TGen adds to the standard seq2seq setup a reranker that selects the output with the lowest SER from the decoder output beam (n-best list).SER is estimated based on a classifier trained to identify the MR corresponding to a given text.We use the default TGen parameters for the E2E data, experimenting with three variants: • TGen without reranker: a vanilla seq2seq model with attention (TGen−); • TGen with default reranker: the same augmented with an LSTM encoder and binary classifier for individual slot-value pairs; • TGen with oracle reranker: directly uses the slot matching script to compute SER (TGen+).We fixed the parameters of the main seq2seq generator to see the direct influence of each reranker, without the added effect of random initialization.

SC-LSTM
In contrast to seq2seq architecture used by TGen, the Semantically Controlled LSTM (SC-LSTM, Wen et al., 2015) uses a learned gating mechanism to selectively express parts of the MR during generation.We use the SC-LSTM model provided as part of the RNNLG repository8 with minor changes to improve comparability to TGen.Most importantly, we incorporate the tokenization and normalization used by TGen into RNNLG.Since the word embeddings provided with RNNLG only cover about half of the tokens in the E2E dataset, we use randomly initialised word embeddings (dimension 50; same as TGen).

Evaluation and Results
To measure the effect of noisy data, we compare systems trained on the original data against systems trained using cleaned TRAIN and validation (=DEV) sets; we perform the comparisons both on the original and the cleaned TEST sets.Note that only scores on the same test set are directly comparable as the cleaned TEST set has more diverse MRs and fewer references per MR (i.e.numbers in Tables 2 and 3 cannot be compared across tables; cf.Section 3).

Automatic Metrics
We use freely available word-overlap-based evaluation metrics (WOM) scripts that come with the E2E data (Dušek et al., 2019), 9 supporting BLEU (Papineni et al., 2002), NIST (Doddington, 2002), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007) and CIDEr (Vedantam et al., 2015).In addition, we use our slot matching script for SER (cf.Section 3).We also show detailed results for the percentages of added and missed slots and wrong slot values. 10 The results in Table 2 (top half) for the original setup confirm that the ranking mechanism for TGen is effective for both WOMs and SER, whereas the SC-LSTM seems to have trouble scaling to the E2E dataset.We hypothesise that this is mainly due to the amount of delexicalisation required.However, the main improvement of SER comes from training on cleaned data with up to 97% error reduction with the ranker and 94% without. 11In other words, just cleaning the training data has a much more dramatic effect than just using a semantic control mechanism, such as the reranker (0.97% vs. 4.27% SER).WOMs are slightly lower for TGen trained on the cleaned data, except for NIST, which gives more importance to matching less frequent n-grams.This suggests better preservation of content at the expense of slightly lower fluency.
The results for testing on cleaned data (Table 3, top half) confirm the positive impact of cleaned training data and also show that the cleaned test data is more challenging (cf.Section 3), as reflected in the lower WOMs.This raises the question whether the improved results from clean training data are due to seeing more challenging examples at training time.However, the improved results for training and testing on clean data (i.e.seeing equally challenging examples at training and test time), suggest the increase in performance can be attributed to data accuracy rather than diversity. 9https://github.com/tuetschek/e2emetrics 10Absolute numbers of errors and number of completely correct instances are shown in  Looking at the detailed results for the number of added, missing, and wrong-valued slots (Add, Miss, Wrong), we observe more deletions than insertions, i.e. the models more often fail to realise part of the MR, rather than hallucinating additional information.To investigate whether this effect stems from the training data, we partially cleaned the data of missing or added information only. 12However, the results in bottom halves 12 We only performed these experiments on TGen because of Tables 2 and 3 do not support our hypothesis: we observe the main effect on SER from cleaning the missed slots, reducing both insertions and deletions.Again, one possible explanation is that cleaning the missing slots provided more complex training examples.

Manual Error Analysis
We carried out a detailed manual error analysis of selected systems to confirm the automatic metrics results, performing a blind annotation of semantic and fluency errors (not a human preference rating).We evaluated a sample of 100 outputs on the original test set produced by TGen with the default reranker trained using all four cleaning settings (original data, cleaned missing slots, cleaned added slots, fully cleaned).The results in Table 4 confirm the findings of the automatic of the low performance of SC-LSTM in general.metrics: systems trained on the fully cleaned set or the set with cleaned missing slots have nearperfect performance, with the fully-cleaned one showing a few more slight disfluencies than the other.The systems trained on the original data or with cleaned added slots clearly perform worse in terms of both semantic accuracy and fluency.All fluency problems we found were very slight and no added or wrong-valued slots were found, so missed slots are the main problem.
The manual error analysis also served to assess the accuracy of the SER measuring script on system outputs.Since NNLG tends to use more frequent phrasing, we expected better performance than on the dataset itself, and this proved true: we only found 2 errors in the 400 system outputs (i.e.99.5% of instances and 99.93% of slots were matched correctly).This confirms that the automatic SER numbers reflect the semantic accuracy of individual systems very closely.

Discussion and Related Work
We present a detailed study of semantic errors in NNLG outputs and how these relate to noise in training data.We found that even imperfectly cleaned input data significantly improves semantic accuracy for seq2seq-based generators (up to 97% relative error reduction with the reranker), while only causing a slight decrease in fluency.
Contemporaneous with our work is the effort of Nie et al. (2019), who focus on automatic data cleaning using a NLU iteratively bootstrapped from the noisy data.Their analysis similarly finds that omissions are more common than hallucinations.Correcting for missing slots, i.e. forcing the generator to verbalise all slots during training, leads to the biggest performance improvement.This phenomenon is also observed by Dušek et al. (2018Dušek et al. ( , 2019) ) for systems in the E2E NLG challenge, but stands in contrast to work on related tasks, which mostly reports on hallucinations (i.e.adding information not grounded in the input), as observed for image captioning (Rohrbach et al., 2018), sports report generation (Wiseman et al., 2017), machine translation (Koehn and Knowles, 2017;Lee et al., 2019), and question answering (Feng et al., 2018).These previous works suggest that the most likely case of hallucinations is an over-reliance on language priors, i.e. memorising 'which words go together'.Similar priors could equally exist in the E2E data for omitting a slot; this might be connected with the fact that the E2E test set MRs tend to be longer than training MRs (6.91 slots on average for test MRs vs. 5.52 for training MRs) and that a large part of them is 'saturated', i.e. contains all possible 8 attributes.
Furthermore, in accordance with our observations, related work also reports a relation between hallucinations and data diversity: Rohrbach et al. (2018) observe an increase for "novel compositions of objects at test time", i.e. non-overlapping test and training sets (cf.Section 3); whereas Lee et al. (2019) reports data augmentation as one of the most efficient counter measures.In future work, we plan to experimentally manipulate these factors to disentangle the relative contributions of data cleanliness and diversity.
Original MR: name[Cotto], eatType[coffee shop], food[English], priceRange[less than £20], customer rating[low], area[riverside], near[The  Portland Arms]Human reference 1 (accurate): At the riverside near The Portland Arms, Cotto is a coffee shop that serves English food at less than £20 and has low customer rating.HR 2: Located near The Portland Arms in riverside, the Cotto coffee shop serves English food with a price range of £20 and a low customer rating.

Table 2 :
Table 5 in the Supplementary.Results evaluated on the original test set (averaged over 5 runs with different random initialisation).See Section 5.1 for explanation of metrics.All numbers except NIST and ROUGE-L are percentages.Note that the numbers are not comparable to Table 3 as the test set is different.

Table 3 :
Results evaluated on the cleaned test set (cf.Table 2 for column details; note that the numbers are not comparable to Table 2 as the test set is different).

Table 4 :
Results of manual error analysis of TGen on a sample of 100 instances from the original test set: total absolute numbers of errors we found (added, missed, wrong values, slight disfluencies).