Dual Latent Variable Model for Low-Resource Natural Language Generation in Dialogue Systems

Recent deep learning models have shown improving results to natural language generation (NLG) irrespective of providing sufficient annotated data. However, a modest training data may harm such models’ performance. Thus, how to build a generator that can utilize as much of knowledge from a low-resource setting data is a crucial issue in NLG. This paper presents a variational neural-based generation model to tackle the NLG problem of having limited labeled dataset, in which we integrate a variational inference into an encoder-decoder generator and introduce a novel auxiliary auto-encoding with an effective training procedure. Experiments showed that the proposed methods not only outperform the previous models when having sufficient training dataset but also demonstrate strong ability to work acceptably well when the training data is scarce.


Introduction
Natural language generation (NLG) plays an critical role in Spoken dialogue systems (SDSs) with the NLG task is mainly to convert a meaning representation produced by the dialogue manager, i.e., dialogue act (DA), into natural language responses. SDSs are typically developed for various specific domains, i.e., flight reservations (Levin et al., 2000), buying a tv or a laptop (Wen et al., 2015b), searching for a hotel or a restaurant (Wen et al., 2015a), and so forth. Such systems often require well-defined ontology datasets that are extremely time-consuming and expensive to collect. There is, thus, a need to build NLG systems that can work acceptably well when the training data is in short supply.
There are two potential solutions for abovementioned problems, which are domain adaptation training and model designing for low-resource training. First, domain adaptation training which aims at learning from sufficient source domain a model that can perform acceptably well on a different target domain with a limited labeled target data. Domain adaptation generally involves two different types of datasets, one from a source domain and the other from a target domain. Despite providing promising results for low-resource setting problems, the methods still need an adequate training data at the source domain site.
Second, model designing for low-resource setting has not been well studied in the NLG literature. The generation models have achieved very good performances irrespective of providing sufficient labeled datasets (Wen et al., 2015b,a;. However, small training data easily result in worse generation models in the supervised learning methods. Thus, this paper presents an explicit way to construct an effective low-resource setting generator. In summary, we make the following contributions, in which we: (i) propose a variational approach for an NLG problem which benefits the generator to not only outperform the previous methods when there is a sufficient training data but also perform acceptably well regarding lowresource data; (ii) present a variational generator that can also adapt faster to a new, unseen domain using a limited amount of in-domain data; (iii) investigate the effectiveness of the proposed method in different scenarios, including ablation studies, scratch, domain adaptation, and semi-supervised training with varied proportion of dataset.
cially RNN Encoder-Decoder models integrating with attention mechanism, such as Enc-Dec (Wen et al., 2016b), and RALSTM . However, such models have proved to work well only when providing a sufficient in-domain data since a modest dataset may harm the models' performance.
In this context, one can think of a potential solution where the domain adaptation learning is utilized. The source domain, in this scenario, typically contains a sufficient amount of annotated data such that a model can be efficiently built, while there is often little or no labeled data in the target domain. A phrase-based statistical generator (Mairesse et al., 2010) using graphical models and active learning, and a multi-domain procedure (Wen et al., 2016a) via data counterfeiting and discriminative training. However, a question still remains as how to build a generator that can directly work well on a scarce dataset.
Neural variational framework for generative models of text have been studied extensively. Chung et al. (2015) proposed a recurrent latent variable model for sequential data by integrating latent random variables into hidden state of an RNN. A hierarchical multi scale recurrent neural networks was proposed to learn both hierarchical and temporal representation (Chung et al., 2016), while Bowman et al. (2015 presented a variational autoencoder for unsupervised generative language model. Sohn et al. (2015) proposed a deep conditional generative model for structured output prediction, whereas Zhang et al. (2016) introduced a variational neural machine translation that incorporated a continuous latent variable to model underlying semantics of sentence pairs. To solve the exposure-bias problem ;  proposed a seq2seq purely convolutional and deconvolutional autoencoder, Yang et al. (2017) proposed to use a dilated CNN decoder in a latentvariable model, or Semeniuta et al. (2017) proposed a hybrid VAE architecture with convolutional and deconvolutional components.

Variational Natural Language Generator
We make an assumption about the existing of a continuous latent variable z from a underlying semantic space of DA-Utterance pairs (d, u), so that we explicitly model the space together with Figure 1: Illustration of proposed variational models as a directed graph. (a) VNLG: joint learning both variational parameters φ and generative model parameters θ. (b) DualVAE: red and blue arrows form a standard VAE (parameterized by φ and θ ) as an auxiliary auto-encoding to the VNLG model denoted by red and black arrows.
variable d to guide the generation process, i.e., p(u|z, d). The original conditional probability p(y|d) modeled by a vanilla encoder-decoder network is thus reformulated as follows: (1) This latent variable enables us to model the underlying semantic space as a global signal for generation. However, the incorporating of latent variable into the probabilistic model arises two difficulties in (i) modeling the intractable posterior inference p(z|d, u) and (ii) whether or not the latent variables z can be modeled effectively in case of lowresource setting data.
To address the difficulties, we propose an encoder-decoder based variational model to natural language generation (VNLG) by integrating a variational autoencoder (Kingma and Welling, 2013) into an encoder-decoder generator (Tran and Nguyen, 2017). Figure 1-(a) shows a graphical model of VNLG. We then employ deep neural networks to approximate the prior p(z|d), true posterior p(z|d, u), and decoder p(u|z, d). To tackle the first issue, the intractable posterior is approximated from both the DA and utterance information q φ (z|d, u) under the above assumption. In contrast, the prior is modeled to condition on the DA only p θ (z|d) due to the fact that the DA and utterance of a training pair usually share the same semantic information, i.e., a given DA inform(name='ABC'; area='XYZ') contains key information of the corresponding utterance "The hotel ABC is in XYZ area". The underlying semantic space with having more information encoded from both the prior and the posterior provides the generator a potential solution to tackle the second issue. Lastly, in generative process, given an observation DA d the output u is generated by the decoder network p θ (u|z, d) under the guidance of the global signal z which is drawn from the prior distribution p θ (z|d). According to (Sohn et al., 2015), the variational lower bound can be recomputed as:

Variational Encoder Network
The encoder consists of two networks: (i) a Bidirectional LSTM (BiLSTM) which encodes the sequence of slot-value pairs {sv i } T DA i=1 by separate parameterization of slots and values (Wen et al., 2016b); and (ii) a shared CNN/RNN Utterance Encoder which encodes the corresponding utterance. The encoder network, thus, produces both the DA representation h D and the utterance representation h U vectors which flow into the inference and decoder networks, and the posterior approximator, respectively (see Suppl. 1.1).

Variational Inference Network
This section models both the prior p θ (z|d) and the posterior q φ (z|d, u) by utilizing neural networks.
Neural Posterior Approximator: We approximate the intractable posterior distribution of z to simplify the posterior inference, in which we first projects both DA and utterance representations onto the latent space: are matrix and bias parameters respectively, d z is the dimensionality of the latent space, and we set g(.) to be ReLU in our experiments. We then approximate the posterior as: with mean µ 1 and standard variance σ 1 are the outputs of the neural network as follows: Neural Prior: We model the prior as follows: where µ 1 and σ 1 of the prior are neural models only based on the Dialogue Act representation, which are the same as those of the posterior q φ (z|d, u) in Eq. 3 and 5, except for the absence of h U . To obtain a representation of the latent variable z, we re-parameterize it as follows: Note here that the parameters for the prior and the posterior are independent of each other. Moreover, during decoding we set h z to be the mean of the prior p θ (z|d), i.e., µ 1 due to the absence of the utterance u. In order to integrate the latent variable h z into the decoder, we use a non-linear transformation to project it onto the output space for generation: h e = g(W e h z + b e )(7), where h e ∈ R de .

Variational Decoder Network
Given a DA d and the latent variable z, the decoder calculates the probability over the generation u as a joint probability of ordered conditionals: The RALSTM cell (Tran and Nguyen, 2017) is slightly modified in order to integrate the representation of latent variable, i.e., h e , into the computational cell (see Suppl. 1.3), in which the latent variable can affect the hidden representation through the gates. This allows the model can indirectly take advantage of the underlying semantic information from the latent variable z. In addition, when the model learns unseen dialogue acts, the semantic representation h e can benefit the generation process (see Table 1).
We finally obtain the VNLG model with RNN Utterance Encoder (R-VNLG) or with CNN Utterance Encoder (C-VNLG).

Variational CNN-DCNN Model
This standard VAE model (left side in Figure 2) acts as an auxiliary auto-encoding for utterance (used at training time) to the VNLG generator. The model consists of two components. While the shared CNN Utterance Encoder with the VNLG model is to compute the latent representation vector h U (see Suppl. 1.1.3), a Deconvolutional CNN Decoder to decode the latent representation h e back to the source text (see Suppl. 2.1). Specifically, after having the vector representation h U , we apply another linear regression to obtain the distribution parameter µ 2 = W µ 2 h U +b µ 2 and log σ 2 2 = W σ 2 h U + b σ 2 . We then re-parameterize them to obtain a latent representation h zu = µ 2 + σ 2 , where ∼ N (0, I). In order to integrate the latent variable h zu into the DCNN Decoder, we use the shared non-linear transformation as in Eq. 7 (denoted by the black-dashed line in Figure 2) as: The entire resulting model, named DualVAE, by incorporating the VNLG with the Variational CNN-DCNN model, is depicted in Figure 2.

Training VNLG Model
Inspired by work of Zhang et al. (2016), we also employ the Monte-Carlo method to approximate the expectation of the posterior in Eq. 2, i.e.
where M is the number of samples. In this work, the joint training objective L VNLG for a training instance pair (d, u) is formulated as: where h , and (m) ∼ N (0, I), and θ and φ denote decoder and encoder parameters, respectively. The first term is the KL divergence between two Gaussian distribution, and the second term is the approximation expectation. We simply set M = 1 which degenerates the second term to the objective of conventional generator. Since the objective function in Eq. 9 is differentiable, we can jointly optimize the parameter θ and variational parameter φ using standard gradient ascent techniques. However, the KL divergence loss tends to be significantly small during training (Bowman et al., 2015). As a results, the decoder does not take advantage of information from the latent variable z. Thus, we apply the KL cost annealing strategy that encourages the model to encode meaningful representations into the latent vector z, in which we gradually anneal the KL term from 0 to 1. This helps our model to achieve solutions with non-zero KL term.

Training Variational CNN-DCNN Model
The objective function L CNN-DCNN of the Variational CNN-DCNN model is the standard VAE lower bound and maximized as follows: where θ and φ denote decoder and encoder parameters, respectively. During training, we also consider a denoising autoencoder where we slightly modify the input by swapping some arbitrary word pairs.

Joint Training Dual VAE Model
To allow the model explore and balance maximizing the variational lower bound between the Variational CNN-DCNN model and VNLG model, an objective is joint training as follows: where α controls the relative weight between two variational losses. During training, we anneal the value of α from 1 to 0, so that the dual latent variable learned can gradually focus less on reconstruction objective of the CNN-DCNN model, only retain those features that are useful for the generation objective.

Joint Cross Training Dual VAE Model
To allow the dual VAE model explore and encode useful information of the Dialogue Act into the latent variable, we further take a cross training between two VAEs by simply replacing the RALSTM Decoder of the VNLG model with the DCNN Utterance Decoder and its objective training L DA-DCNN as: and a joint cross training objective is employed:

Experiments
We assessed the proposed models on four different original NLG domains: finding a restaurant and hotel (Wen et al., 2015a), or buying a laptop and television (Wen et al., 2016b).

Evaluation Metrics and Baselines
The generator performances were evaluated using the two metrics: the BLEU and the slot error rate ERR by adopting code from an NLG toolkit * . We compared the proposed models against strong baselines which have been recently published as NLG benchmarks of those datasets, including

Experimental Setups
In this work, the CNN Utterance Encoder consists of L = 3 layers, which for a sentence of length T = 73, embedding size d = 100, stride length s = {2, 2, 2}, number of filters k = {300, 600, 100} with filter sizes h = {5, 5, 16}, results in feature maps V of sizes {35 × 300, 16 × 600, 1 × 100}, in which the last feature map corresponds to latent representation vector h U . The hidden layer size and beam width were set to be 100 and 10, respectively, and the models were trained with a 70% of keep dropout rate. We performed 5 runs with different random initialization of the network, and the training process is terminated by using early stopping. For the variational inference, we set the latent variable size to be 300. We used Adam optimizer with the learning rate is initially set to be 0.001, and after 5 epochs the learning rate is decayed every epoch using an exponential rate of 0.95.

Results and Analysis
We performed the models in different scenarios as follows: (i) scratch training where models trained from scratch using 10% (scr10), 30% (scr30), and 100% (scr100) amount of in-domain data; and (ii) domain adaptation training where models pre-trained from scratch using all source domain data, then fine-tuned on the target domain using only 10% amount of the target data. Overall, the proposed models can work well in scenarios * https://github.com/shawnwun/RNNLG of low-resource setting data. The proposed models obtained state-of-the-art performances regarding both the evaluation metrics across all domains in all training scenarios.

Integrating Variational Inference
We compare the encoder-decoder RALSTM model to its modification by integrating with variational inference (R-VNLG and C-VNLG) as demonstrated in Figure 3 and Table 1.
It clearly shows that the variational generators not only provide a compelling evidence on adapting to a new, unseen domain when the target domain data is scarce, i.e., from 1% to 7% (Figure 3) but also preserve the power of the original RAL-STM on generation task since their performances are very competitive to those of RALSTM (Table 1, scr100). Table 1, scr10 further shows the necessity of the integrating in which the VNLGs achieved a significant improvement over the RAL-STM in scr10 scenario where the models trained from scratch with only a limited amount of training data (10%). These indicate that the proposed variational method can learn the underlying semantic of the existing DA-utterance pairs, which are especially useful information for low-resource setting.
Furthermore, the R-VNLG model has slightly better results than the C-VNLG when providing sufficient training data in scr100. In contrast, with a modest training data, in scr10, the latter model demonstrates a significant improvement compared to the former in terms of both the BLEU and ERR scores by a large margin across all four dataset. Take Hotel domain, for example, the C-VNLG model ( Table 1: Results evaluated on four domains by training models from scratch with 10%, 30%, and 100% in-domain data, respectively. The results were averaged over 5 randomly initialized networks. The bold and italic faces denote the best and second best models in each training scenario, respectively. STM (68.55 BLEU, 22.53% ERR). Thus, the rest experiments focus on the C-VNLG since it shows obvious sign for constructing a dual latent variable models dealing with low-resource in-domain data. We leave the R-VNLG for future investigation.

Ablation Studies
The ablation studies (Table 1) demonstrate the contribution of each model components, in which we incrementally train the baseline RALSTM, the C-VNLG (= RALSTM + Variational inference), the DualVAE (= C-VNLG + Variational CNN-DCNN), and the CrossVAE (= DualVAE + Cross training) models. Generally, while all models can work well when there are sufficient training datasets, the performances of the proposed models also increase as increasing the model components. The trend is consistent across all training cases no matter how much the training data was provided. Take, for example, the scr100 scenario in which the CrossVAE model mostly outperformed all the previous strong baselines with regard to the BLEU and the slot error rate ERR scores.
On the other hand, the previous methods showed extremely impaired performances regarding low BLEU score and high slot error rate ERR when training the models from scratch with only 10% of in-domain data (scr10). In contrast, by integrating the variational inference, the C-VNLG model, for example in Hotel domain, can significantly improve the BLEU score from 68.55 to 79.98, and also reduce the slot error rate ERR by a large margin, from 22.53 to 8.67, compared to the RALSTM baseline. Moreover, the proposed models have much better performance over the previous ones in the scr10 scenario since the Cross-VAE, and the DualVAE models mostly obtained the best and second best results, respectively. The CrossVAE model trained on scr10 scenario, in some cases, achieved results which close to those of the HLSTM, SCLSTM, and ENCDEC models trained on all training data (scr100) scenario. Take, for example, the most challenge dataset Laptop, in which the DualVAE and CrossVAE obtained competitive results regarding the BLEU score, at 50.16 and 50.85 respectively, which close to those of the HLSTM (51.30 BLEU), SCLSTM (51.09 BLEU), and ENCDEC (51.01 BLEU), while the results regardless the slot error rate ERR scores are also close to those of the previous or even better in some cases, for example DualVAE (2.44 ERR), CrossVAE (2.39 ERR), and ENCDEC (4.24 ERR). There are also some cases in TV domain where the proposed models (in scr10) have results close to or better over the previous ones (trained on scr100). These indicate that the proposed models can encode useful information into the latent variable efficiently to better generalize to the unseen dialogue acts, addressing the second difficulty with low-resource data.
The scr30 section further confirms the effectiveness of the proposed methods, in which the Cross-VAE and DualVAE still mostly rank the best and second-best models compared with the baselines. The proposed models also show superior ability in leveraging the existing small training data to obtain very good performances, which are in many cases even better than those of the previous methods trained on 100% of in-domain data. Take Tv domain, for example, in which the CrossVAE in scr30 achieves a good result regarding BLEU and slot error rate ERR score, at 53.07 BLEU and 0.82 ERR, that are not only competitive to the RALSTM (53.73 BLEU, 0.49 ERR), but also outperform the previous models in scr100 training scenario, such as HLSTM (52.40 BLEU, 2.65 ERR), SCLSTM (52.35 BLEU, 2.41 ERR), and ENCDEC (51.42 BLEU, 3.38 ERR). This further indicates the need of the integrating with variational inference, the additional auxiliary autoencoding, as well as the joint and cross training.

Model comparison on unseen domain
In this experiment, we trained four models (ENCDEC, SCLSTM, RALSTM, and CrossVAE) from scratch in the most difficult unseen Laptop domain with an increasingly varied proportion of training data, start from 1% to 100%. The results are shown in Figure 4. It clearly sees that the BLEU score increases and the slot error ERR decreases as the models are trained on more data. The CrossVAE model is clearly better than the previous models (ENCDEC, SCLSTM, RALSTM) in all cases. While the performance of the Cross-VAE, RALSTM model starts to saturate around 30% and 50%, respectively, the ENCDEC model seems to continue getting better as providing more training data. The figure also confirms that the CrossVAE trained on 30% of data can achieve a better performance compared to those of the previous models trained on 100% of in-domain data.

Domain Adaptation
We further examine the domain scalability of the proposed methods by training the CrossVAE and SCLSTM models on adaptation scenarios, in which we first trained the models on out-ofdomain data, and then fine-tuned the model parameters by using a small amount (10%) of indomain data. The results are shown in Table 2.
Both SCLSTM and CrossVAE models can take advantage of "close" dataset pairs, i.e., Restaurant ↔ Hotel, and Tv ↔ Laptop, to achieve better performances compared to those of the "different" dataset pairs, i.e. Latop ↔ Restaurant. Moreover, Table 2 clearly shows that the SCLSTM (denoted by ) is limited to scale to another domain in terms of having very low BLEU and high ERR scores. This adaptation scenario along with the scr10 and scr30 in Table 1 demonstrate that the SCLSTM can not work when having a low-resource setting of in-domain training data.
On the other hand, the CrossVAE model again show ability in leveraging the out-of-domain data to better adapt to a new domain. Especially in the case where Laptop, which is a most difficult unseen domain, is the target domain the Cross-VAE model can obtain good results irrespective of low slot error rate ERR, around 1.90%, and high BLEU score, around 50.00 points. Surprisingly, the CrossVAE model trained on scr10 scenario in some cases achieves better performance compared to those in adaptation scenario first trained with 30% out-of-domain data (denoted by ) which is also better than the adaptation model trained on 100% out-of-domain data (denoted by ξ).
Preliminary experiments on semi-supervised training were also conducted, in which we trained the CrossVAE model with the same 10% indomain labeled data as in the other scenarios and  50% in-domain unlabeled data by keeping only the utterances u in a given input pair of dialogue act-utterance (d, u), denoted by semi-U50-L10. The results showed CrossVAE's ability in leveraging the unlabeled data to achieve slightly better results compared to those in scratch scenario. All these stipulate that the proposed models can perform acceptably well in training cases of scratch, domain adaptation, and semi-supervised where the in-domain training data is in short supply.

Comparison on Generated Outputs
We present top responses generated for different scenarios from TV (Table 3) and Laptop (Table 4), which further show the effectiveness of the proposed methods.
On the one hand, previous models trained on scr10, scr30 scenarios produce a diverse range of the outputs' error types, including missing, misplaced, redundant, wrong slots, or spelling mistake information, resulting in a very high score of the slot error rate ERR. The ENCDEC, HLSTM and SCLSTM models in Table 3-DA 1, for example, tend to generate outputs with redundant slots (i.e., SLOT HDMIPORT, SLOT NAME, SLOT FAMILY), missing slots (i.e., [l7 family], [4 hdmi port -s]), or even in some cases produce irrelevant slots (i.e., SLOT AUDIO, eco rating), resulting in inadequate utterances.
On the other hand, the proposed models can effectively leverage the knowledge from only few of the existing training instances to better generalize to the unseen dialogue acts, leading to satisfactory responses. For example in Table 3, the proposed methods can generate adequate number of the required slots, resulting in fulfilled utterances (DualVAE-10, CrossVAE-10, DualVAE-30, CrossVAE-30), or acceptable outputs with much fewer error information, i.e., mis-ordered slots in the generated utterances (C-VNLG-30).
For a much easier dialogue act in Table 3-DA 2, previous models still produce some error outputs, whereas the proposed methods seem to form some specific slots into phrase in concise outputs. For example, instead of generating "the proteus 73 is a television" phrase, the proposed models tend to concisely produce "the proteus 73 television". The trend is mostly consistent to those in Table 4.

Conclusion and Future Work
We present an approach to low-resource NLG by integrating the variational inference and introducing a novel auxiliary auto-encoding. Experiments showed that the models can perform acceptably well using a scarce dataset. The ablation studies demonstrate that the variational generator contributes to learning the underlying semantic of DA-utterance pairs, while the variational CNN-DCNN plays an important role of encoding useful information into the latent variable. In the future, we further investigate the proposed models with adversarial training, semi-supervised, or unsupervised training. Model-X where X is amount of training data, i.e. 10%, 30%, or 100%.

Model
Generated Responses from Laptop Domain DA compare(name='satellite pallas 21'; battery='4 hour'; drive='500 gb'; name='satellite dinlas 18'; battery='3.5 hour'; drive='1 tb') Reference compared to satellite pallas 21 which can last 4 hour and has a 500 gb drive , satellite dinlas 18 can last 3.5 hour and has a 1 tb drive . which one do you prefer Enc-Dec-10 the satellite pallas 21 has a 500 gb drive , the satellite dinlas 18 has a 4 hour battery life and a 3.5 hour battery life and a SLOT BATTERY battery life and a 1 tb drive HLSTM-10 the satellite pallas 21 has a 4 hour battery life and a 500 gb drive . which one do you prefer [satellite pallas 18] [3.5 hour battery] [1 tb drive] SCLSTM-10 the satellite pallas 21 has a 4 hour battery , and has a 3.5 hour battery life and a 500 gb drive and a 1 tb drive [satellite dinlas 18] C-VNLG-10 the satellite pallas 21 has a 500 gb drive and a 4 hour battery life . the satellite dinlas 18 has a 3.5 hour battery life and a SLOT BATTERY battery life [1 tb drive] DualVAE-10 the satellite pallas 21 has a 4 hour battery life and a 500 gb drive and the satellite dinlas 18 with a 3.5 hour battery life and is good for business computing . which one do you prefer [1 tb drive] CrossVAE-10 the satellite pallas 21 with 500 gb and a 1 tb drive . the satellite dinlas 18 with a 4 hour battery and a SLOT DRIVE drive . which one do you prefer [3.5 hour battery] Enc-Dec-30 the satellite pallas 21 has a 500 gb drive with a 1 tb drive and is the satellite dinlas 18 with a SLOT DRIVE drive for 4 hour -s . which one do you prefer [3.5 hour battery] HLSTM-30 the satellite pallas 21 is a 500 gb drive with a 4 hour battery life . the satellite dinlas 18 has a 3.5 hour battery life . which one do you prefer [1 tb drive] SCLSTM-30 the satellite pallas 21 has a 500 gb drive . the satellite dinlas 18 has a 4 hour battery life . the SLOT NAME has a 3.5 hour battery life . which one do you prefer [1 tb drive] C-VNLG-30 which one do you prefer the satellite pallas 21 with a 4 hour battery life , the satellite dinlas 18 has a 500 gb drive and a 3.5 hour battery life and a 1 tb drive . which one do you prefer DualVAE-30 satellite pallas 21 has a 500 gb drive and a 4 hour battery life while the satellite dinlas 18 with a 3.5 hour battery life and a 1 tb drive .
[OK] CrossVAE-30 the satellite pallas 21 has a 500 gb drive with a 4 hour battery life . the satellite dinlas 18 has a 1 tb drive and a 3.5 hour battery life . which one do you prefer [OK]