Slot-consistent NLG for Task-oriented Dialogue Systems with Iterative Rectification Network

Data-driven approaches using neural networks have achieved promising performances in natural language generation (NLG). However, neural generators are prone to make mistakes, e.g., neglecting an input slot value and generating a redundant slot value. Prior works refer this to hallucination phenomenon. In this paper, we study slot consistency for building reliable NLG systems with all slot values of input dialogue act (DA) properly generated in output sentences. We propose Iterative Rectification Network (IRN) for improving general NLG systems to produce both correct and fluent responses. It applies a bootstrapping algorithm to sample training candidates and uses reinforcement learning to incorporate discrete reward related to slot inconsistency into training. Comprehensive studies have been conducted on multiple benchmark datasets, showing that the proposed methods have significantly reduced the slot error rate (ERR) for all strong baselines. Human evaluations also have confirmed its effectiveness.


Introduction
Natural Language Generation (NLG), as a critical component of task-oriented dialogue systems, converts a meaning representation, i.e., dialogue act (DA), into natural language sentences. Traditional methods (Stent et al., 2004;Konstas and Lapata, 2013;Wong and Mooney, 2007) are mostly pipeline-based, dividing the generation process into sentence planing and surface realization. Despite their robustness, they heavily rely on handcrafted rules and domain-specific knowledge. In addition, the generated sentences of rule-based approaches are rather rigid, without the variance of human language. More recently, neural network based models (Wen et al., 2015a,b;Dušek and Jurčíček, 2016  Tran and Nguyen, 2017a) have attracted much attention. They implicitly learn sentence planning and surface realisation end-to-end with cross entropy objectives. For example, Dušek and Jurčíček (2016) employ an attentive encoder-decoder model, which applies attention mechanism over input slot value pairs. Although neural generators can be trained end-to-end, they suffer from hallucination phenomenon (Balakrishnan et al., 2019). Examples in Table 1 show a misplacement error of an unseen slot AREA and a missing error of slot NAME by an end-to-end trained model, when compared against its input DA. Motivated by this observation, in this paper, we define slot consistency of NLG systems as all slot values of input DAs shall appear in output sentences without misplacement. We also observe that, for task-oriented dialogue systems, input DAs are mostly with simple logic forms, therefore enabling retrieval-based methods e.g. K-Nearest Neighbour (KNN) to handle the majority of test cases. Furthermore, there exists a discrepancy between the training criterion of cross entropy loss and evaluation metric of slot error rate (ERR), similarly to that observed in neural machine translation (Ranzato et al., 2015). Therefore, it is beneficial to use training methods that integrate the evaluation metrics in their objectives.
In this paper, we propose Iterative Rectification Network (IRN) to improve slot consistency for general NLG systems. IRN consists of a pointer rewriter and an experience replay buffer. Pointer rewriter iteratively rectifies slot-inconsistent generations from KNN or data-driven NLG systems. Experience replay buffer of a fixed size collects candidates, which consist of mistaken cases, for training IRN. Leveraging the above observations, we further introduce a retrieval-based bootstrapping to sample pseudo mistaken cases as candidates for enriching the training data. To foster consistency between training objective and evaluation metrics, we use REINFORCE (Williams, 1992) to incorporate slot consistency and other discrete rewards into training objectives.
Extensive experiments show that, the proposed model, KNN + IRN, significantly outperforms all previous strong approaches. When applying IRN to improve slot consistency of prior NLG baselines, we notice large reductions of their slot error rates. Finally, the effectiveness of the proposed methods are further confirmed using BLEU scores, case analysis and human evaluations.

Delexicalization
Inputs to NLG are structured meaning representations, i.e., DA, which consists of an act type and a list of slot value pairs. Each slot value pair represents the type of information and its content while the act type control the style of sentence. To improve generalization capability of DA, delexicalization technique (Wen et al., 2015a,b;Dušek and Jurčíček, 2016;Tran and Nguyen, 2017a) is widely used to replace all values in reference sentence by their corresponding slot in DA, creating pairs of delexicalized input DAs and output templates.
Hence the most important step in NLG is to generate templates correctly given an input DA. However, this step can introduce missing and misplaced slots, because of modeling errors or unaligned training data (Balakrishnan et al., 2019;Nie et al., 2019;Juraska et al., 2018). Lexicalization is followed after a template is generated, replacing slots in template with corresponding values in DA.

Problem Statement
Formally, we denote a delexicalized input DA as a set x = {x 1 , x 2 , · · · , x N } that consists of an act type and some slots. Universal set S con-tains all possible slots. The output template y = [y 1 , y 2 , · · · , y M ] from NLG systems f (x) is a sequence of tokens (words and slots).
We define a slot extraction function g as where z consists of the DA x and elements of the template y.
A slot-consistent NLG system f (x) satisfies the following constraint: (2) To avoid trivial solutions, we require that f (x) = x.
However, due to the hallucination phenomenon, it is possible to miss or misplace slot value in generated templates (Wen et al., 2015a), which is hard to avoid in neural-based approaches.

KNN-based NLG System
A KNN-based NLG system f KNN is composed of a distant function ρ and a template set Y = {y 1 , y 2 , · · · , y Q } which is collected from Q delexicalized sentences in training corpus.
Given input DA x, the distance is defined as where function # computes the size of a set. During evaluation, system f KNN first ranks the templates in set Y by distant function ρ and then selects the top k (beam size) templates.
3 Architeture Figure 1 shows the architecture of Iterative Rectification Network. It consists of two components: a pointer rewriter to produce templates with improved performance metrics and an experience replay buffer to gather and sample training data. The improvements on slot consistency are obtained via an iterative rewriting process. Assume, at iteration k, we have a template y (k) that is not slot consistent with input DA, i.e., g(y (k) ) = g(x). Then, a pointer rewriter iteratively rewrites it as Above recursion ends once g(y (k) ) = g(x) or a certain number of iterations is reached.  Figure 1: IRN consists of two modules: an experience replay buffer and a pointer rewriter. The experience replay buffer collects mistaken cases from NLG baseline, template and IRN itself (the red dashed arrow) whereas the pointer network outputs templates with improved performance metrics. In each epoch of rectification, IRN obtains samples of cases for training from the buffer and trains a pointer rewriter with metrics such as slot consistency using a policy-based reinforcement learning technique. We omit some trivial connections for brevity.

Pointer Rewriter
The pointer rewriter φ PR is trained to iteratively correct the candidate y (k) given a DA x. This correction operation is conducted time-recurrently. At each position j of rewriting a template, there is a state h j to represent the past history of the pointer rewriter and an action a j to take according to a policy π.

State
We use an autoregressive model, in particular LSTM to compute state h j , given its past state h j−1 , input x and its past output y where DA x is represented by one-hot representation (Wen et al., 2015a,b). c j is a context representation over input template y (k) , to be described in Eq. (6). The operation [; ] means vector concatenation.
Action For position j in the output template y (k) , its action a j is in a space consisting of two categories: template copy, c(i), to copy a token from the template y (k) at i, and word and slot generation, w, to generate a word or a slot at the position. For a length-M input template y (k) , the action a j is therefore in a set of {w, c(1), · · · , c(M )}. The action sequence a for a length-N output template is [a 1 , · · · , a N ].

Template Copy
The model φ PR for template copy uses attentive pointer to decide, for position j, what token to copy from the candidate y (k) . Each token y (k) i in candidate y (k) is represented using an embedding y (k) i . For position j in the output template, this model utilizes the above hidden state h j and computes attentive weights to all of the tokens in y (k) , with weight to token embedding y (k) i as follows: where v a , W h , W y are learnable parameters.
Word and Slot Generation Another candidate for position j is a word or a slot key from a predefined vocabulary. The action w computes a distribution of words and slot keys below where this distribution is dependent on the state h j and matrix W v is learnable.
Policy The probabilities for the above actions can be computed as follows where π(c(i)|h j ) is the probability of copying the i-th token from input template y (k) to position j. π(w|h j ) is the probability to use words or slot keys predicted from the distribution p Vocab j in Eq. (7). The weight λ j is a real value between 0 and 1. It is computed from a Sigmoid operation as λ j = Sigmoid(v h * h j ). With the policy, the pointer rewriter does greedy search to decide whether copying or generating a token.

Experience Replay Buffer
The experience replay buffer aims at providing training samples for IRN. It has three sources of samples. The first is from off-the-shelf NLG systems. The second is from the pointer rewriter in the last iteration. Both of them are real mistaken samples. They are stored in a case set C in the buffer. These samples are off-policy as the case set C can contain samples from many iterations before. The third source is sampled from a bootstrapping algorithm. They are stored in a set Ω.

Iterative Data Aggregation
The replay experiences should be progressive, reflecting improvements in the iterative training of IRN. Therefore, we design an iterative data aggregation algorithm in Algorithm 1. In the algorithm, the experience replay buffer B is defined as a fixed size set of B = C + Ω. For a total epoch number of E, it randomly provides mistaken samples for training pointer rewriter φ PR at each epoch. Importantly, both content of C and Ω are varying from each epoch. For C, it initially consists of real mistaken samples from the baseline system (line 3-th to line 8-th). Later on, it's gradually filled by the samples from the IRN (line 14-th to line 19-th). For Ω, its samples reflect a general distribution of training samples from a template database T (line 10-th). Finally, the algorithm aggregates these two groups of mistaken samples (line 11-th) and use them to train the model φ PR (line 12-th).
Bootstrapping via Retrieval Relying solely on the real mistaken samples exposes the system to data scarcity problem. It is easy to observe that real samples are heavily biased towards certain slots, and the number of real mistaken samples can be small. To address this problem, we introduce a

Mistaken Template
Reference Template extractive slot function word function word noun phrase ambiguity Figure 2: Correcting a candidate given a reference template. d c , d l , and d π are inferred by simple rules.
bootstrapping algorithm, described in Algorithm 2. It uses a template database T , built from delexicalized NLG training corpus and organized by pairs of DA and reference template (x, z).
At each turn of the algorithm, it first randomly samples (line 3-th) a pair (x, z), from training template data base of T . Then for every pair (x,ẑ) in T , it measures if the pair (x,ẑ) is slot-inconsistent with respect to (x, z), and adds the pair that is within a certain distance (a hyper parameter) to a set Z (line 5-th to 11-th). is usually set to a small number so that the selected samples are close enough to (x, z). In practice, we set it to 2. Finally, it does a random sampling (line 12-th) on Z and insert its return into the output set Ω. Such bootstrapping process stops when the number of generated samples reaches a certain limit K.
These samples, which we refer them as pseudo samples in the following, represent a wider coverage of training samples than the real mistaken samples. Because they are sampled from general distribution of the templates, some of semantics are not seen in the real mistaken cases. We will demonstrate through experiments that it effectively addresses data scarcity problem.

Training with Supervised Learning and Distant Supervision
One key idea behind the proposed IRN model is to conduct distant supervision on the actions of template copy and generation. We diagram its motivation in Figure 2. During training, only candidate y and its reference z are given. The exact actions that convert template y to z have to be inferred from the two templates. Here we use simple rules for the inference. Firstly, the rules check if reference token z j exists in the candidate y. The output is a label d c consisting of 1s and 0s, representing whether tokens in the reference template are existent/absent in the candidate. Secondly, the rules locate the orig-inal position d l j in the candidate for each token j in the reference template if d c = 1 and use -1 for d c = 0. Finally, the action label d π for policy is inferred, with w for d l j = −1 and c(i) for d l j = i. We may use the extracted tags to do supervised learning. The loss to be minimized is as follows where L is the length of ground truth. π(d π j |h j ) computes the likelihood of action d π j at position j given state h j .
However, there are following issues when attempting to utilize the labels produced by distant supervision for training. Firstly, the importance of every token in candidate is different. For example, noun phrase (colored in blue) is critical and should be copied. Function words (colored in red) is of little relevance and can be generated by IRN itself. However, distant supervision treats them the same. Secondly, rule-based matching may cause semantic ambiguity (dashed line colored in black). Lastly, the training criterion of cross entropy is not directly relevant to the evaluation metric using slot error rate. To address these issues, we use reinforcement learning to obtain the optimal actions.

Training with Policy-based Reinforcement Learning
In this section, we describe another method to train IRN. We apply policy gradient (Williams, 1992) to optimize models with discrete rewards.

Rewards
Slot Consistency This reward is related to the correctness of output templates. Given the set of slot-value pairs g(y) from the output template generated by IRN and the set of slot-value pairs g(x) extracted from input DA, the reward is zero when  they are equal; otherwise, it is negative with value set to the cardinality of the difference between the two sets as follows Language Fluency This reward is related to the naturalness of the realized surface form from a response generation method. Following (Wen et al., 2015a,b), we first train a backward language model on the reference texts from training data. Then, the perplexity (PPL) of the surface form after lexicalization of the output templateŷ is measured using the language model. This PPL is used for the reward for language fluency as follows: Distant Supervision We also measure the reward from using distant supervision in Section 4. For a length-N reference template, the reward is given as follows: where d π j is the inferred action label. The final reward for action a is a weighted sum of the rewards discussed above: (13) where γ SC +γ LM +γ DS = 1. We set them to equal value in this work. A reward is observed after the last token of the utterance is generated.

Policy Gradient
We utilize supervised learning in Eq. (9) to initialize our model with the labels extracted from distant supervision. After its convergence, we continuously tune the model using policy gradient described in this section. The policy model in φ PR itself generates a sequence of actions a, that are not necessarily the same as d π , and this produces an output template y to compute slot consistency reward in Eq. (10) and language fluency reward in Eq. (11). With these rewards, the final reward is computed in (13). The gradient to back propagate is estimated using REINFORCE as where θ denotes model parameters. r(a) − b is the advantage function per REINFORCE. b is a baseline. Through experiments, we find that b = BLEU(y, z) performs better (Weaver and Tao, 2001) than tricks such as simple averaging of the likelihood 1 N N j=1 log π(a j |h j ).  (Dušek and Jurčíček, 2016), ARoA (Tran and Nguyen, 2017b) and RALSTM (Tran and Nguyen, 2017a). Following these prior works, the evaluation metrics consist of BLEU and slot error rate (ERR), which is computed as where N is the total number of slots in the DA, and p, q is the number of missing and redundant slots in the generated template, respectively.   We follow all baseline performances reported in (Tran and Nguyen, 2017b) and use open source toolkits, RNNLG 1 and Tgen 2 to build NLG systems, HLSTM, SCLSTM and TGen. We reimplement the baselines ARoA and RALSTM since their source codes are not available.

Main Results
We first compare our model, i.e., IRN + KNN with all those strong baselines metioned above. Figure  2 shows that the proposed model significantly outperforms previous baselines on both BLEU score and ERR. Compared with current state-of-the-art model, RALSTM, it achieves reductions of 1.45, 1.38, 1.45 and 1.80 times for SF Restaurant, SF Hotel, Laptop, and Television datasets, respectively. Furthermore, it improves 3.59%, 1.45%, 2.29% and 3.33% of BLEU scores on these datasets, respectively. This improvements of BLEU score can be contributed from language fluency reward r LM .
To verify whether IRN helps improve slot consistency of general NLG models, we further equip strong baselines, including HLSTM, TGen and RALSTM, with IRN. We evaluate their performances on SF Restaurant and Television datasets. As shown in Table 3, the methods consistently reduce ERRs and also improve BLEU scores for all  baselines on both datasets.
In conclusion, our model, IRN (+ KNN), not only has achieved the state-of-the-art performances but also can contribute to improvements of slot consistency for general NLG systems.

Ablation Study
We perform a set of ablation experiments on the SCLSTM+IRN models on Laptop dataset to understand the relative contribution of data aggregation algorithms in Sec. 3.2 and rewards in Sec. 5.1.

Effect of Reward Designs
The results in Table 4 show that removal of slot consistency reward r SC or distant supervision reward r DS from advantage function dramatically degrades SER performance. Language fluency related information from baseline BLEU and reward r LM also have positive impact on BLEU and SER, though they are smaller than using r SC or r DS .

Effect of Data Algorithms
Using only candidates from baselines degrades performance to approximately that of the baseline SCLSTM. This shows that incorporating candidates from IRN is important. The model without bootstrapping, even including candidates from IRN, has worse performance than SCLSTM in Table 3. This shows that bootstrapping to include generic samples from templates database is critical.

Human Evaluation
We evaluate IRN and some strong baselines on TV dataset. Given an input DAs, we ask human eval-  uator to score generated surface realizations from our model and other baselines in terms of informativeness and naturalness. Here informativeness measures whether output utterance contains all the information specified in the DA without insertion of extra slots or missing an input slot. The naturalness is defined as whether it mimics a response from a human (both ratings are out of 5). Table 5 shows that RALSTM + IRN outperforms RALSTM notably in informativeness relatively by 4.97%, from 4.63 to 4.86. In terms of naturalness, the improvement is from 4.01 to 4.07, relative by 1.50%. Meanwhile, IRN helps to improve the performances of TGen by 5.12% on informativeness and 3.23% on naturalness.
These subjective assessments are consistent to the observations in Table 3, which both have verified the effectiveness of proposed method. Table 6 presents a sample on TV dataset and shows a progress made by IRN. Given an input DA, the baseline HLSTM outputs in the third row a template that misses slot $AUDIO$ but inserts slot $PRICE$. The output template from the first iteration of IRN has a removal of the inserted $PRICE$ slot. The second iteration has improved language fluency but no progress in slot-inconsistency. The third iteration achieves slot consistency, after which a natural language, though slightly different from the reference text, is generated via lexicalization.

Related Work
Conventional approaches for solving NLG task are mostly pipeline-based, dividing it into sentence planning and surface realisation (Dethlefs et al., 2013;Stent et al., 2004;Walker et al., 2002). Oh and Rudnicky (2000) introduce a class-based ngram language model and a rule-based reranker. Ratnaparkhi (2002) address the limitations of n-gram language models by using more sophisticated syntactic dependency trees. Mairesse and Young (2014) employ a phrase-based generator that learn from a semantically aligned corpus. Despite their robustness, these models are costly to create and maintain as they heavily rely on handcrafted rules.
Recent works (Wen et al., 2015b;Dušek and Jurčíček, 2016;Tran and Nguyen, 2017a) build data-driven models based on end-to-end learning. Wen et al. (2015a) combine two recurrent neural network (RNN) based models with a CNN reranker to generate required utterances. Wen et al. (2015b) introduce a novel SC-LSTM with an additional reading cell to jointly learn gating mechanism and language model. Dušek and Jurčíček (2016) present an attentive neural generator to apply attention mechanism over input DA. Tran and Nguyen (2017b,a) employ a refiner component to select and aggregate the semantic elements produced by the encoder. More recently, domain adaptation (Wen et al., 2016) and unsupervised learning (Bahuleyan et al., 2018) for NLG also receive much attention.
We are also inspired by the post-edit paradigm (Xia et al., 2017), which uses a second-pass decoder to improve the translation quality.
A recent method in (Wu et al., 2019) defines an auxiliary loss that checks if the object words exist in the expected system response of a task-oriented dialogue system. It would be interesting to apply this auxiliary loss in the proposed method. On the other hand, the REINFORCE (Williams, 1992) algorithm applied in this paper is more general than (Wu et al., 2019) to incorporate other metrics, such as BLEU.
Nevertheless, end-to-end neural-based generators suffer from hallucination problem and are hard to avoid generating slot-inconsistent utterance (Balakrishnan et al., 2019). Balakrishnan et al. (2019) attempts to alleviate this issue by employing a treestructured meaning representation and constrained decoding technique. However, the tree-shaped structure requires additional human annotation.

Conclusion
We have proposed Iterative Rectification Network (IRN) to improve slot consistency of general NLG systems. In this method, a retrieval-based bootstrapping is introduced to sample pseudo mistaken cases from training corpus to enrich the original training data. We also employ policy-based reinforcement learning to enable training the models with discrete rewards that are consistent to evaluation metrics. Extensive experiments show that the proposed model significantly outperforms previous methods. These improvements include both of correctness measured with slot error rates and naturalness measured with BLEU scores. Human evaluation and case study also confirm the effectiveness of the proposed method.