Neural-based Natural Language Generation in Dialogue using RNN Encoder-Decoder with Semantic Aggregation

Natural language generation (NLG) is an important component in spoken dialogue systems. This paper presents a model called Encoder-Aggregator-Decoder which is an extension of an Recurrent Neural Network based Encoder-Decoder architecture. The proposed Semantic Aggregator consists of two components: an Aligner and a Refiner. The Aligner is a conventional attention calculated over the encoded input information, while the Refiner is another attention or gating mechanism stacked over the attentive Aligner in order to further select and aggregate the semantic elements. The proposed model can be jointly trained both sentence planning and surface realization to produce natural language utterances. The model was extensively assessed on four different NLG domains, in which the experimental results showed that the proposed generator consistently outperforms the previous methods on all the NLG domains.


Introduction
Natural Language Generation (NLG) plays a critical role in a Spoken Dialogue System (SDS), and its task is to convert a meaning representation produced by the dialogue manager into natural language sentences. Conventional approaches to NLG follow a pipeline which typically breaks down the task into sentence planning and surface realization. Sentence planning decides the order and structure of a sentence, which is followed by a surface realization which converts the sentence structure into final utterance. Previous approaches to NLG still rely on ex-tensive hand-tuning templates and rules that require expert knowledge of linguistic representation. There are some common and widely used approaches to solve NLG problems, including rulebased (Cheyer and Guzzoni, 2014), corpus-based n-gram generator (Oh and Rudnicky, 2000), and a trainable generator (Ratnaparkhi, 2000).
Recurrent Neural Network (RNN)-based approaches have recently shown promising results in NLG tasks.
The RNN-based models have been used for NLG as a joint training model (Wen et al., 2015a,b) and an end-toend training network (Wen et al., 2016c). A recurring problem in such systems requiring annotated corpora for specific dialogue acts 1 (DAs).
More recently, the attention-based RNN Encoder-Decoder (AREncDec) approaches (Bahdanau et al., 2014) have been explored to tackle the NLG problems (Wen et al., 2016b;Mei et al., 2015;Dušek and Jurčíček, 2016b,a). The AREncDEc-based models have also shown improved results on various tasks, e.g., image captioning (Xu et al., 2015;Yang et al., 2016), machine translation (Luong et al., 2015;. To ensure that the generated utterance represents the intended meaning of the given DA, the previous RNN-based models were conditioned on a 1-hot vector representation of the DA. Wen et al. (2015a) proposed a Long Short-Term Memory-based (HLSTM) model which introduced a heuristic gate to guarantee that the slot-value pairs were accurately captured during generation. Subsequently, Wen et al. (2015b) proposed a LSTM-based generator (SC-LSTM) which jointly learned the controlling signal and language model. Wen et al. (2016b) proposed an AREncDec based generator (ENCDEC) which ap- Table 1: Order issue in natural language generation, in which an incorrect generated sentence has wrong ordered slots.
Input DA Compare(name=Triton 52; ecorating=A+; family=L7; name=Hades 76; ecorating=C; family=L9) INCORRECT The Triton 52 has an A+ eco rating and is in the L9 product family, the Hades 76 is in the L7 product family and has a C eco rating.

CORRECT
The Triton 52 is in the L7 product family and has an A+ eco rating, the Hades 76 is in the L9 product family and has a C eco rating.
plied attention mechanism on the slot-value pairs. Although these RNN-based generators have worked well, however, they still have some drawbacks, and none of these models significantly outperform the others in solving NLG tasks. While the HLSTM cannot handle cases such as the binary slots (i.e., yes and no) and slots that take don't care value in which these slots cannot be directly delexicalized, the SCLSTM model is limited to generalize to the unseen domain, and the ENCDEC model has difficulty to prevent undesirable semantic repetitions during generation.
To address the above issues, we propose a new architecture, Encoder-Aggregator-Decoder, an extension of the AREncDec model, in which the proposed Aggregator has two main components: (i) an Aligner which computes the attention over the input sequence, and (ii) a Refiner which are another attention or gating mechanisms to further select and aggregate the semantic elements. The proposed model can learn from unaligned data by jointly training the sentence planning and surface realization to produce natural language sentences. We conduct comprehensive experiments on four NLG domains and find that the proposed method significantly outperforms the previous methods regarding BLEU (Papineni et al., 2002) and slot error rate ERR scores (Wen et al., 2015b). We also found that our generator can produce high-quality utterances with correctly ordered than those in the previous methods (see Table 1). To sum up, we make two key contributions in this paper: • We present a semantic component called Aggregator which is easy integrated into existing (attentive) RNN encoder-decoder architecture, resulting in an end-to-end generator that empirically improved performance in comparison with the previous approaches.
• We present several different choices of attention and gating mechanisms which can be effectively applied to the proposed semantic Aggregator.
In Section 2, we review related works. The proposed model is presented in Section 3. Section 4 describes datasets, experimental setups and evaluation metrics. The results and analysis are presented in Section 5. We conclude with a brief of summary and future work in Section 6.

Related Work
Conventional approaches to NLG traditionally divide the task into a pipeline of sentence planning and surface realization. The conventional methods still rely on the handcrafted rule-based generators or rerankers. Oh and Rudnicky (2000) proposed a class-based n-gram language model (LM) generator which can learn to generate the sentences for a given dialogue act and then select the best sentences using a rule-based reranker. Ratnaparkhi (2000) later addressed some of the limitation of the class-based LMs by proposing a method based on a syntactic dependency tree. A phrase-based generator based on factored LMs was introduced by Mairesse and Young (2014), that can learn from a semantically aligned corpus.
Recently, RNNs-based approaches have shown promising results in the NLG domain. Vinyals et al. (2015); Karpathy and Fei-Fei (2015) applied RNNs in setting of multi-modal to generate caption for images. Zhang and Lapata (2014) also proposed a generator using RNNs to create Chinese poetry.
For task-oriented dialogue systems, Wen et al. (2015a) combined two TNN-based models with a CNN reranker to generate required utterances. Wen et al. (2015b) proposed SC-LSTM generator which proposed an additional "reading" cell to the traditional LSTM cell to learn the gating mechanism and language model jointly. A recurring problem in such systems lacking of sufficient domain-specific annotated corpora. Wen et al. (2016a) proposed an out-of-domain model which is trained on counterfeited datasets by using semantic similar slots from the target-domain dataset instead of the slots belonging to the out-of-domain dataset. The empirical results indicated that the model can obtain a satisfactory results with a small amount of in-domain data by fine-tuning the target-domain on the out-of-domain trained model. More recently, attentional RNN encoderdecoder based models (Bahdanau et al., 2014) have shown improved results in a variety of tasks. Yang et al. (2016) presented a review network in solving the image captioning task, which produces a compact thought vector via reviewing all the input information encoded by the encoder. Mei et al. (2015) proposed attentional RNN encoder-decoder based model by introducing two layers of attention to model content selection and surface realization. More close to our work, Wen et al. (2016b) proposed an attentive encoderdecoder based generator, which applied the attention mechanism over the slot-value pairs. The model indicated a domain scalability when a very limited proportion of training data is available.

Recurrent Neural Language Generator
The recurrent language generator proposed in this paper is based on a neural net language generator (Wen et al., 2016b) which consists of three components: an encoder to incorporate the target meaning representation as the model inputs, an aggregator to align and control the encoded information, and a decoder to generate output sentences. The generator architecture is shown in Figure 1. While the decoder typically uses an RNN model, there is a variety of ways to choose the encoder  Figure 2: The RNN Encoder-Aggregator-Decoder for NLG proposed in this paper. The output side is an RNN network while the input side is a DA embedding with aggregation mechanism. The Aggregator consists of two parts: an Aligner and a Refiner. The lower part Aligner is an attention over the DA representation calculated by a Bidirectional RNN. Note that the action type embedding a is not included in the attention mechanism since its task is controlling the style of the sentence. The higher part Refiner computes the new input token x t based on the original input token w t and the dialogue act attention d t . There are several choices for Refiner, i.e., gating mechanism or attention mechanism.
because it depends on the nature of the meaning representation and the interaction between semantic elements. The encoder first encodes the input meaning representation, then the aggregator with a feature selecting or an attention-based mechanism is used to aggregate and select the input semantic elements. The input to the RNN decoder at each time step is a 1-hot encoding of a token 2 and the aggregated input vector. The output of RNN decoder represents the probability distribution of the next token given the previous token, the dialogue act representation, and the current hidden state. At generation time, we can sample from this conditional distribution to obtain the next token in a generated sentence, and feed it as the next input to the RNN decoder. This process finishes when a stop sign is generated (Karpathy and Fei-Fei, 2015), or some constraint is reached (Zhang and Lapata, 2014). The network can generate a sequence of tokens which can be lexicalized 3 to form the required utterance.

Gated Recurrent Unit
The encoder and decoder of the proposed model use a Gated Recurrent Unit (GRU) network proposed by Bahdanau et al. (2014), which maps an input sequence W = [w 1 , w 2 , .., w T ] to a sequence of states H = [h 1 , h 2 , .., h T ] as follows: where: ⊙ denotes the element-wise multiplication, r i and u i are called the reset and update gates respectively, andh i is the candidate activation.

Encoder
The encoder uses a separate parameterization of the slots and values. It encodes the source information into a distributed vector representation z i which is a concatenation of embedding vector representation of each slot-value pair, and is computed by: where: o i and v i are the i-th slot and value embedding, respectively. The i index runs over the given slot-value pairs. In this study, we use a Bidirectional GRU (Bi-GRU) to encode the sequence of slot-value pairs 4 embedding. The Bi-GRU consists of forward and backward GRUs. The forward GRU reads the sequence of slot-value pairs from left-to-right and calculates the forward hidden states ( − → s 1 , .., − → s K ). The backward GRU reads the slot-value pairs from right-to-left, resulting in a sequence of backward hidden states ( ← − s 1 , .., ← − s K ). We then obtain the sequence of hidden states S = [s 1 , s 2 , .., s K ] where s i is a sum of the forward hidden state − → s i and the backward one ← − s i as follows:

Aggregator
The Aggregator consists of two components: an Aligner and a Refiner. The Aligner computes the dialogue act representation while the choices for Refiner can be varied. Firstly, the Aligner calculates dialogue act embedding d t as follows: where: a is vector embedding of the action type, ⊕ is vector concatenation, and α t,i is the weight of i-th slot-value pair calculated by the attention mechanism: where: a(., .) is an alignment model,v a , W a , U a are the weight matrices to learn. Secondly, the Refiner calculates the new input x t based on the original input token w t and the DA representation. There are several choices to formulate the Refiner such as gating mechanism or attention mechanism. For each input token w t , the selected mechanism module computes the new input x t based on the dialog act representation d t and the input token embedding w t , and is formulated by: x where: f R is a refinement function, in which each input token is refined (or filtered) by the dialogue act attention information before putting into the RNN decoder. By this way, we can represent the whole sentence based on this refined input using RNN model.
Attention Mechanism: Inspired by work of Cui et al. (2016), in which an attention-overattention was introduced in solving reading comprehension tasks, we place another attention applied for Refiner over the attentive Aligner, resulting in a model Attentional Refiner over Attention (ARoA).
• ARoA with Vector (ARoA-V): We use a simple attention where each input token representation is weighted according to dialogue act attention as follows: where: V ra is a refinement attention vector which is used to determine the dialogue act attention strength, and σ is sigmoid function to normalize the weight β t between 0 and 1.
• ARoA with Matrix (ARoA-M): ARoA-V uses only a vector V ra to weight the DA attention.
It may be better to use a matrix to control the attention information. The Equation 7 is modified as follows: where: W aw is a refinement attention matrix.
• ARoA with Context (ARoA-C): The attention in ARoA-V and ARoA-M may not capture the relationship between multiple tokens.
In order to add context information into the attention process, we modify the attention weights in Equation 8 with additional history information h t−1 : where: W aw , W ah are parameters to learn, V ra is the refinement attention vector same as above, which contains both DA attention and context information.
Gating Mechanism: We use simple elementwise operators (multiplication or addition) to gate the information between the two vectors d t and w t as follows: • Multiplication (GR-MUL): The element-wise multiplication plays a part in word-level matching which learns not only the vector similarity, but also preserve information about the two vectors: • Addition (GR-ADD):

Decoder
The decoder uses a simple GRU model as described in Section 3.1. In this work, we propose to apply the DA representation and the refined inputs deeper into the GRU cell. Firstly, the GRU reset and update gates can be further influenced on the DA attentive information d t . The reset and update gates are modified as follows: where: W rd and W ud act like background detectors that learn to control the style of the generating sentence. By this way, the reset and update gates learn not only the long-term dependency but also the attention information from the dialogue act and the previous hidden state. Secondly, the candidate activationh t is also modified to depend on the DA representation as follows: The hidden state is then computed by: Finally, the output distribution is computed by applying a softmax function g, and the distribution is sampled to obtain the next token,

Training
The objective function was the negative loglikelihood and simply computed by: where: y t is the ground truth word distribution, p t is the predicted word distribution, T is length of the input sequence. The proposed generators were trained by treating each sentence as a minibatch with l 2 regularization added to the objective function for every 10 training examples. The pre-trained word vectors (Pennington et al., 2014) were used to initialize the model. The generators were optimized by using stochastic gradient descent and back propagation through time (Werbos, 1990). To prevent over-fitting, we implemented early stopping using a validation set as suggested by Mikolov (2010).  Table 3: Comparison performance of variety of the proposed models on four dataset in terms of the BLEU and the error rate ERR(%) scores; bold denotes the best and italic shows the second best model. The first two models applied gating mechanism to Refiner component while the last three models used attention over attention mechanism. The results were averaged over 5 randomly initialized networks.

Model
Restaurant

Decoding
The decoding consists of two phases: (i) overgeneration, and (ii) reranking. In the overgeneration, the generator conditioned on the given DA uses a beam search to generate a set of candidate responses. In the reranking, the cost of the generator is computed to form the reranking score R as follows: where λ is a trade off constant and is set to a large value in order to severely penalize nonsensical outputs. The slot error rate ERR, which is the number of slots generated that is either redundant or missing, and is computed by: where: N is the total number of slots in DA, and p, q is the number of missing and redundant slots, respectively. Note that the ERR reranking criteria cannot handle arbitrary slot-value pairs such as binary slots or slots that take the dont care value because these slots cannot be delexicalized and matched.

Experiments
We conducted an extensive set of experiments to assess the effectiveness of our model using several metrics, datasets, and model architectures, in order to compare to prior methods.

Datasets
We assessed the proposed models using four different NLG domains: finding a restaurant, finding a hotel, buying a laptop, and buying a televi- These datasets contain about 13K distinct DAs in the Laptop domain and 7K distinct DAs in the TV. Both Laptop and TV datasets have a much larger input space but only one training example for each DA so that the system must learn partial realization of concepts and be able to recombine and apply them to unseen DAs. As a result, the NLG tasks for the Laptop and TV datasets become much harder.

Experimental Setups
The generators were implemented using the Ten-sorFlow library (Abadi et al., 2016) and trained by partitioning each of the datasets into training, validation and testing set in the ratio 3:1:1. The hidden layer size was set to be 80 for all cases, and the generators were trained with a 70% of dropout rate. We perform 5 runs with different random initialization of the network and the training is terminated by using early stopping as described in Section 3.5. We select a model that yields the highest BLEU score on the validation set as shown in Table 2. Since the trained models can differ depending on the initialization, we also report the results which were averaged over 5 randomly initialized networks. Note that, except the results reported in Table 2, all the results shown were averaged over 5 randomly initialized networks. The decoder procedure used beam search with a beam width of 10. We set λ to 1000 to severely discourage the reranker from selecting utterances which contain either redundant or missing slots. For each DA, we over-generated 20 candidate utterances and selected the top 5 realizations after reranking. Moreover, in order to better understand the effectiveness of our proposed methods, we (1) trained the models on the Laptop domain with a varied proportion of training data, starting from 10% to 100% (Figure 3), and (2) trained general models by merging all the data from four domains together and tested them in each individual domain (Figure 4) .

Evaluation Metrics and Baselines
The generator performance was assessed by using two objective evaluation metrics: the BLEU score and the slot error rate ERR. We conducted extensive experiments on the proposed models with varied setups of Refiner and compared against the previous methods. Overall, the proposed models consistently achieve the better performances regarding both evaluation metrics across all domains. Table 2 shows a comparison between the AREncDec based models (the models with ♯ symbol) in which the proposed models significantly reduce the slot error rate across all datasets by a large margin about 2% to 4% that are also improved performances on the BLEU score when comparing the proposed models against the previous approaches. Table 3 further shows the stable strength of our models since the results' pattern stays unchanged compared to those in Table 2. The ARoA-M model shows the best performance over all the four domains, while it is an interesting observation that the GR-ADD model with simple addition operator for Refiner obtains the second best performance. All these prove the importance of the proposed component Refiner in aggregating and selecting the attentive information. Figure 3 illustrates a comparison of four models (ENCDEC, SCLSTM, ARoA-M, and GR-ADD) which were trained from scratch on the laptop dataset in a variety of proportion of training data, from 10% to 100%. It clearly shows that the BLEU increases while the slot error rate decreases as more training data was provided. Figure 4 presents a comparison performance of general models as described in Section 4.2. Not surprisingly, the two proposed models still obtain higher the BLEU score, while the ENCDEC has difficulties in reducing the ERR score in all cases. Both the proposed models show their ability to generalize in the unseen domains (TV and Laptop datasets) since they consistently outperform the previous methods no matter how much training data was fed or how training method was used. These indicate the relevant contribution of the proposed component Refiner to the original AREncDec architecture, in which the Refiner with gating or attention mechanism can effectively aggregate the information before putting them into the RNN decoder. Figure 5 shows a different attention behavior of the proposed models in a sentence. While all the three models could attend the slot tokens and their   surrounding words, the ARoA-C model with context shows its ability in attending the consecutive words. Table 4 shows comparison of responses generated for some DAs between different models. The previous approaches (ENCDEC, HLSTM) still have missing and misplaced information, whereas the proposed models can generate complete and correct-order sentences.

Conclusion and Future Work
We present an extension of an Attentional RNN Encoder-Decoder model named Encoder-Aggregator-Decoder, in which a Refiner component is introduced to select and aggregate the semantic elements produced by the encoder. We also present several different choices of gating and at-tention mechanisms which can be effectively applied to the Refiner. The extension, which is easily integrated into an RNN Encoder-Decoder, shows its ability to refine the inputs and control the flow information before putting them into the RNN decoder. We evaluated the proposed model on four domains and compared to the previous generators. The proposed models empirically show consistent improvement over the previous methods in both BLEU and ERR evaluation metrics. In the future, it would be interesting to further investigate hybrid models which integrate gating and attention mechanisms in order to leverage the advantages of both mechanisms. Table 4: Comparison of top responses generated for some input dialogue acts between different models. Errors are marked in color (missing, misplaced slot-value pair). † and ♮ denotes the baselines and the proposed models, respectively.

Model
Generated Responses in Laptop domain Input DA compare (name='aristaeus 59'; screensizerange='large'; resolution='1080p'; name='charon 61'; screen-sizerange='medium'; resolution='720p') Reference Compared to aristaeus 59 which is in the large screen size range and has 1080p resolution, charon 61 is in the medium screen size range and has 720p resolution. Which one do you prefer? ENCDEC † the aristaeus 59 has a large screen , the charon 61 has a medium screen and 1080p resolution [1080p, 720p] HLSTM † the aristaeus 59 has a large screen size range and has a 1080p resolution and 720p resolution [720p, charon 61, medium] SCLSTM † the aristaeus 59 has a large screen and 1080p resolution , the charon 61 has a medium screen and 720p resolution GR-ADD ♮ the aristaeus 59 has a large screen size and 1080p resolution , the charon 61 has a medium screen size and 720p resolution GR-MUL ♮ the aristaeus 59 has a large screen size and 1080p resolution , the charon 61 has a medium screen size and 720p resolution . ARoA-V ♮ the aristaeus 59 has a large screen size and 1080p resolution , the charon 61 has a medium screen size , and has a 720p resolution ARoA-M ♮ the aristaeus 59 has a large screen and 1080p resolution , the charon 61 has a medium screen and 720p resolution ARoA-C ♮ the aristaeus 59 has a large screen size and 1080p resolution , the charon 61 has a medium screen size range and 720p resolution