A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation

Recent neural language generation systems often hallucinate contents (i.e., producing irrelevant or contradicted facts), especially when trained on loosely corresponding pairs of the input structure and text. To mitigate this issue, we propose to integrate a language understanding module for data refinement with self-training iterations to effectively induce strong equivalence between the input data and the paired text. Experiments on the E2E challenge dataset show that our proposed framework can reduce more than 50% relative unaligned noise from the original data-text pairs. A vanilla sequence-to-sequence neural NLG model trained on the refined data has improved on content correctness compared with the current state-of-the-art ensemble generator.


Introduction
Neural models for natural language generation (NLG) based on the encoder-decoder framework have become quite popular recently (Wen et al., 2015;Mei et al., 2016;Wiseman et al., 2017;Wen et al., 2017;Chisholm et al., 2017;Nie et al., 2018, inter alia). Albeit being appealing for producing fluent and diverse sentences, neural NLG models often suffer from a severe issue of content hallucination (Reiter, 2018a), which refers to the problem that the generated texts often contain information that is irrelevant to or contradicted with the input.
Given that similar issues have been less reported or noticed in the latest neural machine translation systems, we believe that the origin of the issue for neural NLG comes from the data side. Current datasets used for training neural NLG systems often include instances that do not contain the same amount of information from the input structure and the output text (Perez-Beltrachini and Gardent, 2017). There is no exception for datasets originally intended for surface realisation ("how to say") without focusing on content selection ("what to say"). Table 1 depicts an example, where the attribute Rating=5 out of 5 in the input meaning representation (MR) is not verbalised in a reference text written by human, while the word restaurant in the reference should refer to an attribute value EatType=Restaurant not contained in the MR. Without explicit alignments in between MRs and the corresponding utterances for guidance, neural systems trained on such data often produce unexpected errors.
Previous work attempted at injecting indirect semantic control over the encoder-decoder architecture (Wen et al., 2015;Dušek and Jurcicek, 2016;Agarwal et al., 2018) or encouraging consistency during training (Chisholm et al., 2017), without essentially changing to the noisy training data. One exception is the Slug2Slug system (Juraska et al., 2018), where the authors use an aligner with manually written heuristic rules to filter out unrealized attributes from data.
In this paper, we propose a simple, automatic recipe towards reducing hallucination for neural surface realisers by enhancing the semantic equivalence between pairs of MRs and utterances. The steps include: (1) Build a language understanding module (ideally well-calibrated) that tries to parse the MR from an utterance; (2) Use it to reconstruct the correct attribute values revealed in the reference texts; (3) With proper confidence thresh-olding, conduct self-training to iteratively recover data pairs with identical or equivalent semantics.
Experiments on the E2E challenge benchmark (Novikova et al., 2017b) show that our framework can reduce more than 50% relative unaligned noise from original MR-text pairs, and a vanilla sequence-to-sequence model trained on the refined data can improve content correctness in both human and automatic evaluations, when compared with the current state-of-the-art neural ensemble system (Juraska et al., 2018).

Approach
Our proposed framework consists of a neural natural language understanding (NLU) module with iterative data refinement to induce semantically equivalent MR-text pairs from a dataset containing a moderate level of noise.

Notation
Formally, given a corpus with paired meaning representations and text descriptions {(R, X)} N i=1 , the input MR R = (r 1 , . . . , r M ) is a set of slotvalue pairs r j = (s j , v j ), where each r j contains a slot s j (e.g., rating) and a value v j (e.g., 5 out of 5). The corpus has M pre-defined slots , and each slot s j has K j unique categorical values v j ∈ (c j,1 , . . . , c j,K j ). The corresponding utterance X = (x 1 , . . . , x T ) is a sequence of words describing the MR.

Neural NLU Model
As shown in Figure 1, the NLU model consists of a self-attentive encoder and an attentive scorer.
Self-Attentive Encoder. The encoder produces the vector representations of slot-value pairs in MR and its paired utterance. A slot-value pair r can be treated as a short sequence W = (w 1 , . . . , w n ) by concatenating words in its slot and value. The word sequence W is first represented as a sequence of word embedding vectors (v 1 , . . . , v n ) from a pre-trained embedding matrix E, and then passed through a bidirectional LSTM layer to yield the contextualized representations U sv = (u sv 1 , . . . , u sv n ). To produce a summary context vector for U sv , we adopt the same selfattention structure in Zhong et al. (2018) to obtain the sentence vector c s , due to the effectiveness of self-attention modules over variable-length sequences. Similarly, we obtain the contextualized D Name = The Golden Palace

Slot-value pair
Output text The Golden Palace is … Attentive Scorer. The scorer calculates the semantic similarity between a slot-value pair r (e.g., Price=Cheap) and the utterance X (e.g., reference in Table 1). Firstly, an attention layer is applied to select the most salient words in X related to r, which yields the attentive representation d of utterance X. Given the sentence vector c s of the slot-value pair r and the attentive vector d of the utterance X, the normalized semantic similarity is defined as: (1) Model Inference. Each utterance X will be parsed to an MR R e = (r e 1 , . . . , r e M ), with each slot-value pair r e j = (s j , v j ) determined by selecting the candidate value v j with the maximum semantic similarity for each slot s j : where c j,k denotes the kth categorical value for jth slot. Since an utterance may not describe any information about a specific slot s, we add a NONE value as a candidate value of each slot.
Model Training. The NLU model is optimized by minimizing the cross-entropy loss: where θ denotes model parameters, and r i,j denotes the jth slot-value pair in the ith training MR.

Iterative Data Refinement
The performance of NLU can be inaccurate when trained on noisy data-text pairs. However, models trained on data with a moderate level of noise could still be well-calibrated. This could enable an iterative relabeling procedure, where we only take MRs produced by NLU with high confidence together with their utterances as new training MRtext pairs to bootstrap the NLU training. Algorithm 1 describes the training procedure. We first pre-train the NLU model using the original data-text pairs for N pre iterations. Then the NLU model parses relevant MR for every utterance in training data, which can be used as new training examples (Line 4). However, due to the inaccuracy of the NLU results, we only use a small portion (φ is set to 40% on validation) with high confidence. Moreover, as each MR consists of up to M slots with some of them being unreliable, we filter the slot-value pairs with slot probability below average according to slot confidence (Line 8 -14). Finally, the NLU model is fine-tuned with the new training corpus D e . This process is repeated for N tune epochs. The final NLU model is leveraged to parse all utterances in the training corpus. The resulting MRs paired with original utterances form the refined training corpus for NLG.

Setup
Dataset. Our experiments are conducted on E2E challenge (Novikova et al., 2017b) dataset, which aims at verbalizing all information from the MR. It has 42,061, 4,672 and 4,693 MR-text pairs for training, validation and testing, respectively. Note that every input MR in this dataset has 8.65 different references on average. The test set has 630 unique input MRs. We examine the effectiveness of our proposed method in two aspects: 1) reducing the noise in data-text pairs (NLU), 2) reducing hallucinated contents in surface realisation (NLG).
Automatic metrics. The well-crafted rule-based aligner built by Juraska et al. (2018) 1 is adopted to approximately reflect the semantic correctness of NLU and NLG models. The error rate is calculated by matching the slot values in output utterance: Err = M N , where N is the total number Update θ with Eq. 3 on D e 17: end for of MR-text pairs, and M is the number of wrong MR-text pairs which contain missing or conflict slots in the realization given its input MR. BLEU-4 (Papineni et al., 2002) is also reported, although currently neither BLEU nor any other automatic metrics could be convincingly used for evaluating language generation (Novikova et al., 2017a;Chaganty et al., 2018;Reiter, 2018b, inter alia).
Human Evaluation. We randomly sample 100 data-text pairs from test set and ask three crowd workers to manually annotate missed (M), added (A), and contradicted (C) slot values in NLG outputs with respect to the input MR, or exact match (E) if all slot values have been realized in the given utterance which contains no additional hallucinated information. When evaluating the NLU systems, missed and added slots refer to the opposite directions, respectively.

Main Results
NLU Results. One challenge in E2E dataset is the need to account for the noise in the corpus as some of the MR-text pairs are not semantically equivalent due to the data collection process (Dušek et al., 2018). We examine the performance of the NLU module by comparing noise reduction of the reconstructed MR-text pairs with the original ones in both training and test sets. Table 2 shows the automatic results. Applying our NLU model with iterative data refinement, the error rates of refined MR-text pairs yields 23.33% absolute error reduction on test set. Human evaluation in Table 3 shows that our proposed method achieves 16.69% improvement on information equivalence between MR-text pairs. These results confirm the effectiveness of our method in reducing the unaligned data noise, and the large improvement (i.e, 15.09%) on exact match when applying self-training algorithm suggests the importance of iterative data refinement.
NLG Results. Table 4 presents the automatic results of different neural NLG systems. We can see that Seq2Seq+aug+iter achieves comparable BLEU score as Slug2Slug but with 4.44% error reduction on content correctness over     Slug2Slug. Seq2Seq+aug+iter largely improves the content correctness over the baseline Seq2Seq with 67.3% error reduction. Besides, we also replace our NLU module with the rule based aligner crafted by Juraska et al. (2018) for data refinement to inspect the difference between our proposed method and manually designed rich heuristics. We can observe that these two models (Seq2Seq+aug+iter and Seq2Seq+aligner) achieve comparable performance, while our approach is fully automatic and requires no domain knowledge. The human evaluation results are shown in Table 5. We can find that Seq2Seq+aug+iter improves 2.59% accuracy on exact match over Slug2Slug. Specifically, Slug2Slug augments original training data by only deleting additional slot values not realized in the utterance with an aligner, which is not capable of the situation where the given utterance contains incorrect or additional slot values and leads more con-Utterance: Located in riverside, near Caf Sicilia, is the Phoenix, a French pub that is family-friendly and has average prices and an average rating.

TGen
The Mill is a high priced family friendly fast food pub located near Caf Sicilia in the riverside area.

Slug2Slug
children friendly pub in the riverside area near Caf Sicilia. It has a high price range and a high customer rating

Seq2Seq
The Mill is a family friendly pub located near Caf Sicilia.

Seq2Seq+ aug+iter
The Mill is a children friendly fast food pub near Caf Sicilia in the riverside area. It has a high price range and an average customer rating. tradicted errors. Our method can complement and correct original MR with additional slot values described in the paired texts to effectively alleviate generating contradicted facts. However, due to the imperfection of NLU model, our method may ignore part of slot values realized in utterances and produce some additional errors.

Case Study
Example for refined data. Table 6 depicts a case for one pair with originally inaccurate MR while being corrected by NLU module and iterative refinement. Our proposed method is capable of reducing the unaligned noise for original data.
Example for NLG. Table 7 shows the sentences generated by different NLG systems. Seq2Seq without any semantic control tends to generate shorter descriptions. Slug2Slug and TGen with reranker to control the content coverage can generate more input information, but still misses one input information and Slug2Slug produces a contradicted fact (i.e., customer rating). Our proposed method Seq2Seq+aug+iter trained on refined MR-text pairs, verbalises all the input information correctly, which shows the importance of data quality in terms of strong equivalence between MR and utterance.

Discussion
In this paper, we present a simple recipe to reduce the hallucination problem in neural language generation: introducing a language understanding module to implement confidence-based iterative data refinement. We find that our proposed method can effectively reduce the noise in the original MR-text pairs from the E2E dataset and improve the content coverage for standard neural surface realisation (no focus on content selection). However, the currently presented approach still has two clear limitations. One is that this simple approach is implicitly built on an assumption of a moderate level of noise in the original data, which makes it possible to bootstrap a well-calibrated NLU module. We are still on the way to find out solutions for cases with huge noise (Perez-Beltrachini and Lapata, 2018;Wiseman et al., 2017), where heavy manual intervention or external knowledge should be desperately needed.
The other limitation of this preliminary work is that it currently overlooks the challenges of lexical choices for quantities, degrees, temporal expressions, etc, which are rather difficult to learn merely from data and should require additional commonsense knowledge. An example case is in Table 6, where the original priceRange=20-25 is refined to be priceRange=moderate, which enhances the correspondence between the MR and the text but sidesteps the lexical choice for numbers which requires localised numerical commonsense. Additional modules for lexical choices should be expected for a refined system.

A Effect of φ on NLG model
The parameter φ controls the proportion of relevant MRs produced by NLU model for iterative training. Figure 2 shows its influence for NLG on the content coverage measurement. The experimental result shows NLG models trained on data produced by self-training achieve error reduction in content coverage. As the NLU model can bring inaccurate instances when performing iterative data augmentation, controlling the proportion φ from 20% to 40% can yield better results compared to introducing all the MRs produced by NLU for self-training.