Natural Language Generation for Spoken Dialogue System using RNN Encoder-Decoder Networks

Natural language generation (NLG) is a critical component in a spoken dialogue system. This paper presents a Recurrent Neural Network based Encoder-Decoder architecture, in which an LSTM-based decoder is introduced to select, aggregate semantic elements produced by an attention mechanism over the input elements, and to produce the required utterances. The proposed generator can be jointly trained both sentence planning and surface realization to produce natural language sentences. The proposed model was extensively evaluated on four different NLG datasets. The experimental results showed that the proposed generators not only consistently outperform the previous methods across all the NLG domains but also show an ability to generalize from a new, unseen domain and learn from multi-domain datasets.


Introduction
Natural Language Generation (NLG) plays a critical role in Spoken Dialogue Systems (SDS) with task is to convert a meaning representation produced by the Dialogue Manager into natural language utterances. Conventional approaches still rely on comprehensive hand-tuning templates and rules requiring expert knowledge of linguistic representation, including rulebased (Mirkovic et al., 2011), corpus-based ngram models (Oh and Rudnicky, 2000), and a trainable generator (Stent et al., 2004).
Recently, Recurrent Neural Networks (RNNs) based approaches have shown promising performance in tackling the NLG problems. The RNNbased models have been applied for NLG as a joint training model (Wen et al., 2015a,b) and an end-to-end training model (Wen et al., 2016c). A recurring problem in such systems is requiring annotated datasets for particular dialogue acts 1 (DAs). To ensure that the generated utterance representing the intended meaning of the given DA, the previous RNN-based models were further conditioned on a 1-hot vector representation of the DA. Wen et al. (2015a) introduced a heuristic gate to ensure that all the slot-value pair was accurately captured during generation. Wen et al. (2015b) subsequently proposed a Semantically Conditioned Long Short-term Memory generator (SC-LSTM) which jointly learned the DA gating signal and language model. More recently, Encoder-Decoder networks (??), especially the attentional based models (Wen et al., 2016b;Mei et al., 2015) have been explored to solve the NLG tasks. The Attentional RNN Encoder-Decoder (Bahdanau et al., 2014) (ARED) based approaches have also shown improved performance on a variety of tasks, e.g., image captioning (Xu et al., 2015;Yang et al., 2016), text summarization (Rush et al., 2015;Nallapati et al., 2016).
While the RNN-based generators with DA gating-vector can prevent the undesirable semantic repetitions, the ARED-based generators show signs of better adapting to a new domain. However, none of the models show significant advantage from out-of-domain data. To better analyze model generalization to an unseen, new domain as well as model leveraging the out-ofdomain sources, we propose a new architecture which is an extension of the ARED model. In order to better select, aggregate and control the semantic information, a Refinement Adjustment LSTM-based component (RALSTM) is introduced to the decoder side. The proposed model can learn from unaligned data by jointly training the sentence planning and surface realization to produce natural language sentences. We conducted experiments on four different NLG domains and found that the proposed methods significantly outperformed the state-of-the-art methods regarding BLEU (Papineni et al., 2002) and slot error rate ERR scores (Wen et al., 2015b). The results also showed that our generators could scale to new domains by leveraging the out-of-domain data. To sum up, we make three key contributions in this paper: • We present an LSTM-based component called RALSTM cell applied on the decoder side of an ARED model, resulting in an endto-end generator that empirically shows significant improved performances in comparison with the previous approaches.
• We extensively conduct the experiments to evaluate the models training from scratch on each in-domain dataset.
• We empirically assess the models' ability to: learn from multi-domain datasets by pooling all available training datasets; and adapt to a new, unseen domain by limited feeding amount of in-domain data.
We review related works in Section 2. Following a detail of proposed model in Section 3, Section 4 describes datasets, experimental setups, and evaluation metrics. The resulting analysis is presented in Section 5. We conclude with a brief summary and future work in Section 6.

Related Work
Recently, RNNs-based models have shown promising performance in tackling the NLG problems. Zhang and Lapata (2014) proposed a generator using RNNs to create Chinese poetry. Xu et al. (2015); Karpathy and Fei-Fei (2015); Vinyals et al. (2015) Wen et al. (2015b) proposed SC-LSTM generator which introduced a control sigmoid gate to the LSTM cell to jointly learn the gating mechanism and language model. A recurring problem in such systems is the lack of sufficient domain-specific annotated data. Wen et al. (2016a) proposed an out-of-domain model which was trained on counterfeited data by using semantically similar slots from the target domain instead of the slots belonging to the out-of-domain dataset. The results showed that the model can achieve a satisfactory performance with a small amount of in-domain data by fine tuning the target domain on the out-of-domain trained model.
More recently, RNN encoder-decoder based models with attention mechanism (Bahdanau et al., 2014) have shown improved performances in various tasks. Yang et al. (2016) proposed a review network to the image captioning, which reviews all the information encoded by the encoder and produces a compact thought vector. Mei et al. (2015) proposed RNN encoderdecoder-based model by using two attention layers to jointly train content selection and surface realization. More close to our work, Wen et al. (2016b) proposed an attentive encoder-decoder based generator which computed the attention mechanism over the slot-value pairs. The model showed a domain scalability when a very limited amount of data is available.
Moving from a limited domain dialogue system to an open domain dialogue system raises some issues. Therefore, it is important to build an open domain dialogue system that can make as much use of existing abilities of functioning from other domains. There have been several works to tackle this problem, such as (Mrkšić et al., 2015) using RNN-based networks for multi-domain dialogue state tracking, (Wen et al., 2016a) using a procedure to train multi-domain via multiple adaptation steps, or Williams, 2013) adapting of SDS components to new domains.

Recurrent Neural Language Generator
The recurrent language generator proposed in this paper is based on a neural language generator (Wen et al., 2016b), which consists of three main components: (i) an Encoder that incorporates the target meaning representation (MR) as the model inputs, (ii) an Aligner that aligns and controls the semantic elements, and (iii) an RNN Decoder that Figure 1: Unrolled presentation of the RNNsbased neural language generator. The Encoder part is a BiLSTM, the Aligner is an attention mechanism over the encoded inputs, and the Decoder is the proposed RALSTM model conditioned on a 1-hot representation vector s. The fading color of the vector s indicates retaining information for future computational time steps.
generates output sentences. The generator architecture is shown in Figure 1. The Encoder first encodes the MR into input semantic elements which are then aggregated and selected by utilizing an attention-based mechanism by the Aligner. The input to the RNN Decoder at each time step is a 1-hot encoding of a token 2 w t and an attentive DA representation d t . At each time step t, RNN Decoder also computes how much the feature value vector s t−1 retained for the next computational steps, and adds this information to the RNN output which represents the probability distribution of the next token w t+1 . At generation time, we can sample from this conditional distribution to obtain the next token in a generated sentence, and feed it as the next input to the RNN Decoder. This process finishes when an end sign is generated (Karpathy and Fei-Fei, 2015), or some constraints are reached (Zhang and Lapata, 2014). The model can produce a sequence of tokens which can finally be lexicalized 3 to form the required utterance. Figure 2: The RALSTM cell proposed in this paper, which consists of three components: an Refinement Cell, a traditional LSTM Cell, and an Adjustment Cell. At time step t, while the Refinement cell computes new input tokens x t based on the original input tokens and the attentional DA representation d t , the Adjustment Cell calculates how much information of the slot-value pairs can be generated by the LSTM Cell.

Encoder
The slots and values are separated parameters used in the encoder side. This embeds the source information into a vector representation z i which is a concatenation of embedding vector representation of each slot-value pair, and is computed by: where u i , v i are the i-th slot and value embedding vectors, respectively, and ⊕ is vector concatenation. The i index runs over the L given slot-value pairs. In this work, we use a 1-layer, Bidirectional LSTM (Bi-LSTM) to encode the sequence of slotvalue pairs 4 embedding. The Bi-LSTM consists of forward and backward LSTMs which read the sequence of slot-value pairs from left-to-right and right-to-left to produce forward and backward sequence of hidden states ( − → e 1 , .., − → e L ), and ( ← − e 1 , .., ← − e L ), respectively. We then obtain the sequence of encoded hidden states E = (e 1 , e 2 , .., e L ) where e i is a sum of the forward hidden state − → e i and the backward one ← − e i as follows:

Aligner
The Aligner utilizes attention mechanism to calculate the DA representation as follows: where and β t,i is the weight of i-th slot-value pair calculated by the attention mechanism. The alignment model a is computed by: where v a , W a , U a are the weight matrices to learn. Finally, the Aligner calculates dialogue act embedding d t as follows: where a is vector embedding of the action type.

RALSTM Decoder
The proposed semantic RALSTM cell applied for Decoder side consists of three components: a Refinement cell, a traditional LSTM cell, and an Adjustment cell: Firstly, instead of feeding the original input token w t into the RNN cell, the input is recomputed by using a semantic gate as follows: where W rd and W rh are weight matrices. Element-wise multiplication ⊙ plays a part in word-level matching which not only learns the vector similarity, but also preserves information about the two vectors. W rh acts like a key phrase detector that learns to capture the pattern of generation tokens or the relationship between multiple tokens. In other words, the new input x t consists of information of the original input token w t , the DA representation d t , and the hidden context h t−1 . r t is called a Refinement gate because the input tokens are refined by a combination gating information of the attentive DA representation d t and the previous hidden state h t−1 . By this way, we can represent the whole sentence based on the refined inputs. Secondly, the traditional LSTM network proposed by Hochreiter and Schmidhuber (2014) in which the input gate i i , forget gate f t and output gates o t are introduced to control information flow and computed as follows: where n is hidden layer size, W 4n,4n is model parameters. The cell memory value c t is modified to depend on the DA representation as: whereh t is the output. Thirdly, inspired by work of Wen et al. (2015b) in which the generator was further conditioned on a 1-hot representation vector s of given dialogue act, and work of Lu et al. (2016) that proposed a visual sentinel gate to make a decision on whether the model should attend to the image or to the sentinel gate, an additional gating cell is introduced on top of the traditional LSTM to gate another controlling vector s. Figure 6 shows how RAL-STM controls the DA vector s. First, starting from the 1-hot vector of the DA s 0 , at each time step t the proposed cell computes how much the LSTM outputh t affects the DA vector, which is computed as follows: where W ax , W ah are weight matrices to be learned. a t is called an Adjustment gate since its task is to control what information of the given DA have been generated and what information should be retained for future time steps. Second, we consider how much the information preserved in the DA s t can be contributed to the output, in which an additional output is computed by applying the output gate o t on the remaining information in s t as follows: where W os is a weight matrix to project the DA presentation into the output space,h a is the Adjustment cell output. Final RALSTM output is a combination of both outputs of the traditional LSTM cell and the Adjustment cell, and computed as follows: Finally, the output distribution is computed by applying a softmax function g, and the distribution can be sampled to obtain the next token, where DA = (s, z).

Training
The objective function was the negative loglikelihood and computed by: where: y t is the ground truth token distribution, p t is the predicted token distribution, T is length of the input sentence. The proposed generators were trained by treating each sentence as a mini-batch with l 2 regularization added to the objective function for every 5 training examples. The models were initialized with a pretrained Glove word embedding vectors (Pennington et al., 2014) and optimized by using stochastic gradient descent and back propagation through time (Werbos, 1990). Early stopping mechanism was implemented to prevent over-fitting by using a validation set as suggested in (Mikolov, 2010).

Decoding
The decoding consists of two phases: (i) overgeneration, and (ii) reranking. In the overgeneration, the generator conditioned on both representations of the given DA use a beam search to generate a set of candidate responses. In the reranking phase, cost of the generator is computed to form the reranking score R as follows: where λ is a trade off constant and is set to a large value in order to severely penalize nonsensical outputs. The slot error rate ERR, which is the number of slots generated that is either missing or redundant, and is computed by: where N is the total number of slots in DA, and p, q is the number of missing and redundant slots, respectively.

Experiments
We extensively conducted a set of experiments to assess the effectiveness of the proposed models by using several metrics, datasets, and model architectures, in order to compare to prior methods.

Datasets
We assessed the proposed models on four different NLG domains: finding a restaurant, finding a hotel, buying a laptop, and buying a television. The Restaurant and Hotel were collected in (Wen et al., 2015b), while the Laptop and TV datasets have been released by (Wen et al., 2016a) with a much larger input space but only one training example for each DA so that the system must learn partial realization of concepts and be able to recombine and apply them to unseen DAs. This makes the NLG tasks for the Laptop and TV domains become much harder. The dataset statistics are shown in Table 1.

Experimental Setups
The generators were implemented using the Ten-sorFlow library (Abadi et al., 2016) and trained with training, validation and testing ratio as 3:1:1. The hidden layer size, beam size were set to be 80 and 10, respectively, and the generators were trained with a 70% of dropout rate. We performed 5 runs with different random initialization of the network and the training is terminated by using early stopping. We then chose a model that yields the highest BLEU score on the validation set as shown in Table 2. Since the trained models can  differ depending on the initialization, we also report the results which were averaged over 5 randomly initialized networks. Note that, except the results reported in Table 2, all the results shown were averaged over 5 randomly initialized networks. We set λ to 1000 to severely discourage the reranker from selecting utterances which contain either redundant or missing slots. For each DA, we over-generated 20 candidate sentences and selected the top 5 realizations after reranking. Moreover, in order to better understand the effectiveness of our proposed methods, we: (i) performed an ablation experiments to demonstrate the contribution of each proposed cells (Tables 2, 3), (ii) trained the models on the Laptop domain with varied proportion of training data, starting from 10% to 100% (Figure 3), (iii) trained general models by merging all the data from four domains together and tested them in each individual domain (Figure 4), and (iv) trained adaptation models on merging data from restaurant and hotel domains, then fine tuned the model on laptop domain with varied amount of adaptation data ( Figure 5).

Evaluation Metrics and Baselines
The generator performance was assessed on the two evaluation metrics: the BLEU and the slot error rate ERR by adopting code from an open source benchmark toolkit for Natural Language Generation 5 . We compared the proposed models against three strong baselines which have been recently published as state-of-the-art NLG benchmarks 5 .
• HLSTM proposed by Wen et al. (2015a) which used a heuristic gate to ensure that all of the slot-value information was accurately captured when generating.
• SCLSTM proposed by Wen et al. (2015b) which can jointly learn the gating signal and language model.

Results
We conducted extensive experiments on our models and compared against the previous methods. Overall, the proposed models consistently achieve the better performance regarding both evaluation metrics across all domains in all test cases.

Model Comparison in an Unseen Domain
The ablation studies (Tables 2, 3) demonstrate the contribution of different model components   in which the models were assessed without Adjustment cell (w/o A), or without Refinement cell (w/o R). It clearly sees that the Adjustment cell contributes to reducing the slot error rate ERR score since it can effectively prevent the undesirable slot-value pair repetitions by gating the DA vector s. A comparison between the ARED-based models (denoted by ♯ in Table 2) shows that the proposed models not only have better performance with higher the BLEU score but also significantly reduce the slot error rate ERR score by a large margin about 2% to 4% in every datasets. Moreover, a comparison between the models with gating the DA vector also indicates that the proposed models (w/o R, RALSTM) have significant improved performance on both the evaluation metrics across the four domains compared to the SCLSTM model. The RALSTM cell without the Refinement cell is similar as the SCLSTM cell. However, it obtained the results much better than the SCLSTM baselines. This stipulates the necessary of the LSTM encoder and the Aligner in effectively partial learning the correlated order between slot-value representation in the DAs, especially for the unseen domain where there is only one training example for each DA. Table 3 further demonstrates the stable strength of our models since the results' pattern stays unchanged compared to those in Table 2. Figure 3 shows a comparison of three models (Enc-Dec, SCLSTM, and RALSTM) which were trained from scratch on the unseen laptop domain in varied proportion of training data, from 1% to 100%. It clearly shows that the RALSTM outperforms the previous models in all cases, while the Enc-Dec has a much greater ERR score comparing to the two models.
A comparison of top responses generated for some input DAs between different models are shown in Table 4. While the previous models still produce some errors (missing and misplaced information), the proposed models (RALSTM and the models All2* trained by pooling all datasets together) can generate appropriate sentences. We also found that the proposed models tend to generate more complete and concise sentences than the other models. All these prove the importance of the proposed components: the Refinement cell in aggregating and selecting the attentive information, and the Adjustment cell in controlling the feature vector (see Examples in Figure 6). Figure 4 shows a comparison performance of general models as described in Section 4.2. The results are consistent with the Figure 3, in which the RALSTM has better performance than the Enc-Dec and SCLSTM on all domains in terms of the BLEU and the ERR scores, while the Enc-Dec has difficulties in reducing the ERR score. This indicates the relevant contribution of the proposed component Refinement and Adjustment cells to the original ARED architecture, in which the Refinement with attentional gating can effectively select and aggregate the information before putting them into the traditional LSTM cell, while the Adjustment with gating DA vector can effectively control the  information flow during generation. Figure 5 shows domain scalability of the three models in which the models were first trained on the merging out-of-domain Restaurant and Hotel datasets, then fine tuned the parameters with varied amount of in-domain training data (laptop domain). The RALSTM model outperforms the previous model in both cases where the sufficient indomain data is used (as in Figure 5-left) and the limited in-domain data is used (Figure 5-right). The Figure 5-right also indicates that the RALSTM model can adapt to a new, unseen domain faster than the previous models.

Conclusion and Future Work
We present an extension of ARED model, in which an RALSTM component is introduced to select and aggregate semantic elements produced by the Encoder, and to generate the required sentence. We assessed the proposed models on four NLG domains and compared to the state-of-theart generators. The proposed models empirically show consistent improvement over the previous methods in both the BLEU and ERR evaluation metrics. The proposed models also show an ability to extend to a new, unseen domain no matter how much the in-domain training data was fed. In the future, it would be interesting to apply the proposed model to other tasks that can be modeled based on the encoder-decoder architecture, i.e., image captioning, reading comprehension, and machine translation.