Improving Variational Autoencoder for Text Modelling with Timestep-Wise Regularisation

The Variational Autoencoder (VAE) is a popular and powerful model applied to text modelling to generate diverse sentences. However, an issue known as posterior collapse (or KL loss vanishing) happens when the VAE is used in text modelling, where the approximate posterior collapses to the prior, and the model will totally ignore the latent variables and be degraded to a plain language model during text generation. Such an issue is particularly prevalent when RNN-based VAE models are employed for text modelling. In this paper, we propose a simple, generic architecture called Timestep-Wise Regularisation VAE (TWR-VAE), which can effectively avoid posterior collapse and can be applied to any RNN-based VAE models. The effectiveness and versatility of our model are demonstrated in different tasks, including language modelling and dialogue response generation.

However, there is a challenging optimisation issue of VAEs known as posterior collapse (a.k.a.KL loss vanishing), where the variational posterior collapses to the prior and the latent variable is ignored by the model during generation (Bowman et al., 2016).This is particularly prevalent when employing VAE-RNN architectures for text modelling.When posterior collapse happens, the decoder will be downgraded to a simpler language model and the VAE cannot learn good latent representations of data (Sønderby et al., 2016;Yang et al., 2017).Different strategies have been proposed to address this issue, such as annealing the KL term in the VAE loss function (Bowman et al., 2016;Sønderby et al., 2016;Fu et al., 2019), replacing the recurrent decoder with convolutional neural networks (CNNs) (Yang et al., 2017;Semeniuta et al., 2017), using a sophisticated prior distribution such as the von Mises-Fisher (vMF) distribution (Xu and Durrett, 2018); and adding mutual information into the VAE objectives (Phuong et al., 2018).While the aforementioned strategies have shown effectiveness in tackling the posterior collapse issue to some extent, they either require careful engineering between the reconstruction loss and the KL loss (Bowman et al., 2016;Sønderby et al., 2016;Fu et al., 2019), or designing more sophisticated model structures (Yang et al., 2017;Semeniuta et al., 2017;Xu and Durrett, 2018;Phuong et al., 2018).
In this paper, we propose a simple and robust architecture called Timestep-Wise Regularisation VAE (TWR-VAE), which can effectively alleviate the VAE posterior collapse issue in text modelling.Existing VAE-RNN models for text modelling only impose KL regularisation on the latent variable of the RNN encoder at the final timestep, forcing the latent variable to be close to a Gaussian prior.In contrast, our TWR-VAE imposes KL regularisation on the latent variables of every timestep of the RNN encoder, which we dub timestep-wise regularisation.We hypothesise that timestep-wise regularisation is crucial to avoid posterior collapse and to learn good representations of data, and allows a more robust model learning process.In addition, the proposed timestep-wise regularisation strategy is generic and in theory can be applied to any existing VAE-RNN models, e.g., vanilla RNN and GRU-based VAE models.TWR-VAE shares some similarity with existing VAE-RNN models, where the input to the decoder is the latent variable sample from the variational posterior at the final timestep of the encoder.While this is a reasonable design choice, we also explore two model variants of TWR-VAE, namely, TWR-VAE mean and TWR-VAE sum .At each time step, both model variants sample a latent variable from the timestep dependent variational posterior of the encoder.TWR-VAE mean averages the sampled latent variables which is then used as input to the decoder, whereas TWR-VAE sum performs vector addition on the sampled latent variables instead.
To demonstrate the effectiveness of our method, we select a number of strong baseline models and conduct comprehensive evaluations in two benchmark tasks involving five public datasets.For the language modelling task, experimental results show that our TWR-VAE model can effectively alleviate the posterior collapse issue and consistently give better predictive performance than all the baselines as evidenced by both quantitative (e.g., negative log likelihood and perplexity) and qualitative evaluation.For the dialogue response generation task, our model yields better or comparable performance to the state-ofthe-art baselines based on three evaluation metrics (i.e.BLEU (Zhao et al., 2017), BOW embedding (Liu et al., 2016) and Dist (Liu et al., 2016)).Manual examination also shows that the dialogue responses generated by our model are more diverse and contentful than the baselines, as well as being simpler in model design.Our two model variants also show comparable performance to the best baseline, although not as strong as TWR-VAE.
In summary, the contribution of our paper are three-fold: (1) we propose a simple and robust method, which can effectively alleviate the posterior collapse issue of VAE via timestep-wise regularisation; (2) our approach is generic which can be applied to any RNN-based VAE models; (3) our approach outperforms the state-of-art on language modelling and yields better or comparable performance on dialogue response generation.The code of TWR-VAE is available at: https://github.com/ruizheliUOA/TWR-VAE.
Several different types of methods were proposed to address this issue.KL annealing is the most common and basic solution used in almost all works (Bowman et al., 2016;Sønderby et al., 2016;Semeniuta et al., 2017;He et al., 2019;Fu et al., 2019;Fang et al., 2019).Another type of approaches attempt to weaken the decoder of VAE to avoid posterior collapse, such as introducing word dropout and historyless decoding into the decoder (Bowman et al., 2016), replacing the decoder with different CNNs (Yang et al., 2017;Semeniuta et al., 2017), and adding skip connections in the decoder (Dieng et al., 2019).Others tried to solve this issue by introducing new regularisers (Zhao et al., 2019;Goyal et al., 2017;Tolstikhin et al., 2018), using more sophisticated prior distributions (Tomczak and Welling, 2018;Xu and Durrett, 2018), etc.
More recently, Fu et al. (2019) used a cyclical annealing schedule to alleviate the KL loss vanishing issue.He et al. (2019) proposed a lagging inference network to update the encoder multiple times before a single decoder update to address the issue from the perspective of training dynamics.Zhu et al. (2020) applied the batch normalisation to the parameters of the approximate posterior and ensured that the lower

Methodology
In this section, we introduce the proposed Timestep-Wise Regularisation VAE (TWR-VAE) model as well as its two model variants.We briefly introduce the background of VAE before describing the technical details of the proposed models.

Background of VAE
A variational autoencoder is a generative model, which is designed to generate data via a latent variable z.
For a dataset X = {x i } N i=1 with N i.i.d.data, there are two steps in the data generation process: (1) a latent variable z is sampled from a prior distribution P θ (z); (2) a data x i is generated from the conditional distribution P θ (x i |z).We need to optimise the marginal likelihood P θ (x i ) = P θ (z)P θ (x i |z)dz using VAE.However, both of the marginal likelihood P θ (x i ) and the true posterior distribution P θ (z|x i ) = P θ (x i |z)P θ (z)/P θ (x i ) are intractable.In order to train VAE, an encoder Q φ (z|x i ) is used to approximate the true posterior P θ (z|x i ).In this way, a data x i is encoded as a distribution of z via the encoder Q φ (z|x i ) and the latent code z is fed into the decoder P θ (x i |z) to decode a distribution over some values of x i .
In general, the VAE is trained to maximise the marginal log likelihood log P θ (x 1 , . . ., x N ) = N i=1 log P θ (x i ) for the whole training dataset.This is essentially equivalent to maximising the following evidence lower bound (ELBO)1 , which consists of two terms (Kingma and Welling, 2014): (1) The first term is the expected reconstruction error indicating how well the model can reconstruct data given a latent variable.The the second term is the KL-divergence of the approximate posterior from prior, i.e., a regularisation pushing the learned posterior to be as close to the prior as possible.The basic VAE-RNN model (Figure 1(a)) follows the aforementioned ELBO (i.e.Eq. 1).As the architecture of the encoder is a RNN, a latent variable (denoted as z T ) is sampled from the variational posterior at the final timestep T , and then z T is used as the input to the decoder.Therefore, the ELBO of a basic VAE-RNN model becomes: (2) Note that the total number of timestep T is also the length of the input sentence.As discussed, optimising ELBO (in Eq. 2) is prone to posterior collapsing to the prior (Bowman et al., 2016).This phenomenon happens when the second term of Eq. 2 would approach to its global minimum when Q φ (z T |x i ) = P (z T ), which results that x and z T are two independent variables.As a result, the decoder (i.e., the reconstruction term) no longer depends on z T and it fits the training data as a plain language model.

Variational Autoendoder with Timestep-Wise Regularisation (TWR-VAE)
In this section, we introduce the proposed Timestep-Wise Regularisation (TWR-VAE) model, a general architecture which can effectively mitigate the posterior collapse issue frequently observed in the VAE models with RNN-based backbone.
Our model design is motivated by one noticeable defect shared by the VAE-RNN based models in previous works (Bowman et al., 2016;Yang et al., 2017;Xu and Durrett, 2018;Dieng et al., 2019).That is, the general architecture of all these models, as shown in Figure 1(a), only impose a standard normal distribution prior on the last hidden state of the RNN encoder, which potentially leads to learning a suboptimal representation of the latent variable.In addition, to avoid posterior collapsing, it is important to learn good latent representations of data at the early stage of decoder training, so that the decoder can easily adopt them to generate controllable observations (Fu et al., 2019).Our hypothesis is that to learn a good representation of data, it is crucial to impose the standard normal prior on the hidden states of all timesteps of the RNN-based encoder, which will allow a better regularisation of the model learning process especially during the early stages.
The architecture of the proposed TWR-VAE model is shown in Figure 1(b), which is implemented using a one-layer LSTM for both the encoder and decoder.For each timestep t, we feed the hidden state h t into two linear transformation layers for estimating µ t and Σ t , which are parameters of the variational posterior, i.e., a normal distribution corresponding to the h t .We then impose KL regularisation on all timestep-wise variational posteriors rather than posterior of the last timestep.Formally, given input X = {x i } N i=1 , the ELBO of our model for each data pint x i is defined as: where T is the length of the input sentence, θ and φ are the parameters for the decoder and the encoder, respectively.Note that TWR-VAE is similar to existing VAE-RNN models (Xu and Durrett, 2018;Fu et al., 2019;He et al., 2019), which passes a single z T at the final timestep to the decoder.However, there is a crucial difference that while existing models only impose KL regularisation on the last timestep, TWR-VAE imposes timestep-Wise KL regularisation and average the KL loss over all timesteps, i.e., the second term of Eq. 3.
Here M indicates the total number of times that we randomly sample for approximation.

TWR-VAE mean and TWR-VAE sum
In TWR-VAE, the input to the decoder is the latent variable sample from the variational posterior at the final timestep of the encoder.While this is a reasonable design choice, we also explore two model variants of TWR-VAE, namely, TWR-VAE mean and TWR-VAE sum (see Figure 1(c)).At each time step, both model variants sample a latent variable from the timestep dependent variational posterior of the encoder.
For TWR-VAE mean , the timestep-wise latent variables {z t } T t=1 are sampled first and then they are averaged before feeding to the decoder.This leads to a different reconstruction loss of TWR-VAE mean compared to TWR-VAE (Eq.3): where For TWR-VAE sum , it performs vector addition on the sampled latent variables {z t } T t=1 instead and the corresponding reconstruction loss is: where For both TWR-VAE mean and TWR-VAE sum , their KL loss term is the same as TWR-VAE, i.e., 4 Experiment
We represent input data with 512-dimensional word2vec embeddings (Mikolov et al., 2013) and set the dimension of the hidden layers of both one-layer encoder and decoder to 256.Appendix D shows more details.
We compare our TWR-VAE model with five strong baselines2 : VAE-LSTM: A VAE with LSTM and with KL annealing for tackling the posterior collapse issue (Bowman et al., 2016) (Li et al., 2019a;Zhu et al., 2020).↓ denotes lower the better and ↑ higher the better.
We report the performance on four metrics: negative log likelihood (NLL), perplexity (PPL), KLdivergence which measures the distance between two probability distributions, and the mutual information of the input x and the latent variable z, which measures how much information of x is obtained by z.Following Dieng et al. ( 2019) and He et al. (2019), the mutual information is formulated as is the KL divergence between the aggregated posterior and the prior estimated by Monte Carlo estimators (see Appendix E for the whole derivation).Results.As depicted in Table 2, our TWR-VAE outperforms all baselines on all datasets.Compared to the strongest baseline BN-VAE, our model reduces NLL by 11.8 and PPL by 24.1 on average across three datasets, showing superior performance in reconstructing input sentences.As shown in Table 2, the two variants of TWR-VAE also yields better performance to the baselines.For instance, TWR-VAE mean outperforms all baselines on PTB and Yahoo datasets and yield comparable results to BN-VAE on Yelp.This shows the effectiveness of our strategy of regularising timestep-wise variational posteriors.Model generalisability and Ablation studies.We also evaluate the model's generalisability by looking at how well our timestep-wise regularisor works in different RNN architectures.To this end, we tested Basic-VAE RNN and Basic-VAE GRU (i.e., vanilla RNN and GRU model with KL annealing ), as well as TWR-VAE RNN and TWR-VAE GRU (vanilla RNN and GRU with the timestep-wise regularisor).Experimental results in Table 3 show that our TWR models outperform the corresponding basic models on all evaluation metrics, regardless the encoder architecture.This shows the generalisability of our proposed architecture.
In addition, to understand how the proportion of timesteps that are imposed with KL regularisation impacts the performance of our model, we run a battery of experiments with varying proportion settings.Concretely, we impose KL regularisation on the last 25%, 50%, and 75% timesteps of the encoder of TWR-VAE, respectively.(NB: the KL regularisation is imposed on the final timestep for all model variants).The results in Table 3 show that TWR-VAE LSTM-last25 has the lowest performance on NLL and PPL and the performance goes up along with higher proportion of timesteps being imposed with KL regularisation.In addition, when comparing these three model variants with the baseline VAE-LSTM (which only imposes the KL regularisation on the final timestep), our models can effectively mitigate posterior collapse.This observation embodies that imposing the KL regularisation on earlier timesteps is an effective strategy for mitigating posterior collapse.Moreover, the more timesteps we impose the KL regularisation on, the better performance the model can yield (in terms of NLL and PPL).Latent representation interpolation.We perform latent representation interpolation to assess how well the latent space (z) can be learned by TWR-VAE comparing to the strongest baseline BN-VAE.Given a pair of sentences x 1 and x 2 , we sample their latent codes z T 1 and z T 2 from the encoder, and interpolate

Yelp15
Input 1 this is the worst restaurant experience i 've ever had !not only is this place super slow in service but the food was not fresh !Input 2 i went to this place last month with my best friend and the food was good i love the coffee designs and the service was friendly .

BN-VAE
α = 0 this place the worst restaurant i i have ever had .i only was the restaurant a overpriced , the , the food is not good and i α = 0.2 this place joke ! the food was ok the was horrible .i ask for drink and came back to me .i will go back .α = 0.4 this place joke ! the food was good horrible .i ask for a drink and check on me .i ask for a drink and check on me .α = 0.6 i was try this place.disappointed .the food was not good it was just ok .the service was good the food was not price .α = 0.8 i went lunch and the chicken and waffles .the food was good the service was horrible .i will go back .α = 1 i went here this place for night and my family friend and i food was great .had the atmosphere and and the service was great .i TWR-VAE α = 0 this is the worst restaurant i 've ever been !service only was we restaurant was slow service but the food was not fresh !α = 0.2 i love this place the food was very slow !service is always slow and the food is not a good value so this was not my first choice .α = 0.4 i have never been in this restaurant before the food was just ok and the service is very slow !i will not continue to go back to this place .α = 0.6 i have been here a few times now and the food was good !!! the food is good and i would recommend to and return α = 0.8 i went here this past weekend to see how good the food was and my husband had the same thing i would recommend for the price .α = 1 i went to this place for night and my family friend and the food was good and would the service and the service was friendly .Table 4: An example of interpolating the latent representation of two input sentences using BN-VAE and TWR-VAE in Yelp15 testset (see the example of Yahoo testset in appendix G).
them with 4 shows an example outputs by varying mixture weight α.It can be observed that our model learns representations which are more smooth than BN-VAE, where the sentences generated based on continuous samples from the latent code space preserve more consistent topical information in the neighbourhood of the path.There are less UNK tokens occurring in generated sentences of our model, which implies that the quality of representations learned in our model is better than ones in BN-VAE.In addition to qualitative evaluation, we also evaluate the outputs quantitatively with ROUGE (Lin, 2004), which compares the generated sentences against the human references.Concretely, for each sentence pair, we compute the ROUGE-1, ROUGE-2 and ROUGE-L F1 scores between two input sentences (i.e., references) and each interpolation sentence.The averaged ROUGE scores over all sentence pairs in the test set versus different α settings are sketched in Figure 2. It can observed that as the mixture weight α increases, the ROUGE values of our model smoothly decrease w.r.t. the first reference and increase for the second one, showing a smooth transition of sentence interpolation.One can also note that our model has higher ROUGE scores than BN-VAE at α = 0 for reference one and at α = 1 for reference two, showing that our model is able to better learning latent representations and reconstructing the input sentences.

Dialogue Response Generation
In addition to language modelling, we further evaluate how well our proposed architecture could help alleviating the problem of "generic response" in Dialogue Systems (Huang et al., 2020;Wang et al., 2020).Dialogue systems that are built upon the sequence-to-sequence (seq2seq) model were found tend to generate generic and dull responses, such as "I don't know" or "thank you" (Li et al., 2016).One effective solution is using a more flexible intermediate representation between the encoder and the decoder of a seq2seq model with the help of a VAE, which models dialogue as a one-to-many problem and, therefore, can generate less generic responses.Such VAE-based dialogue response generators, similar to Shen et al. (2018), also face the problem of posterior collapse.Zhao et al. (2017) first addressed this issue by proposing the conditional VAE (CVAE) model which utilises KL annealing and Bag-of-Word loss.To test TWR-VAE on the dialogue response generation task, we extend TWR-VAE following the architecture of CVAE.
We represent each dialogue conversation as a combination of the dialogue context c (context window size J), the response utterance x (the J + 1 th utterance), and a latent representation z which encodes the information of the context and captures a latent distribution of valid responses.The dialogue response generation can then be defined as P θ (x|c) = P θ (x|z, c)P θ (z|c)dz.Here, a vatiational posterior Q φ (z|x, c) is used to approximate the true prior P θ (z|c).The ELBO of TWR-VAE can then be written as: Setup.We conducted experiment based on two popular benchmark datasets, namely, Switchboard (SW) (Godfrey and Holliman, 1997) and Dailydialog (DD) (Li et al., 2017b).For dataset statistics, please refer to Table 1.Following the implementation of CVAE, we pair each response with 10 context utterances (i.e.J = 10) from both speakers.The utterance encoder is a one-layer bidirectional GRU with 300 hidden size; both the context encoder and the decoder use a one-layer GRU with 300 hidden size.The dimension of the latent variable is 200.Appendix F shows more details.
Apart from comparing TWR-VAE to CVAE and iVAE, we further report the results of two other competitive models for dialogue response generation3 , i.e., SeqGAN (Li et al., 2017a) and a conditional Table 6: Four sample responses generated by iVAE and our model on SW (top) and DD (bottom) datasets, given context as input.Corresponding topic and target response (gold standard) are also listed.The generated utterances are different possible responses from two models.We only show the last utterance of the dialogue context here due to space limit (the actual context window is 10).
Wasserstein autoencoder called WAE (Gu et al., 2019).Following prior works (Gu et al., 2019;Fang et al., 2019), we report performance on three evaluation metrics including: (1) BLEU scores proposed by Zhao et al. (2017), which evaluates how many n-grams multiple generated responses match the references.Zhao et al. (2017) defined BLUE precision (BLEU-P) and recall (BLEU-R) as the average and maximum BLUE score, respectively, and define BLEU-F as combination of BLEU-P and BLEU-R.n < 4 is used in our evaluation; (2) BOW embedding (Liu et al., 2016), a cosine similarity of bag-of-words embeddings between the generated response and the reference.Three different variants of BOW embedding were tested: (1) Greedy: the average cosine similarities between word embeddings of the two utterances which are greedily matched (Rus and Lintean, 2012); (2) Average: the cosine similarity between the averaged word embeddings in the two utterances (Mitchell and Lapata, 2008); (3) Extreme: the cosine similarity between the largest extreme values in the word embeddings of the two utterances (Pennington et al., 2014); (3) Dist (Gu et al., 2019), which measures the diversity of the generated dialogue responses by calculating the ratio of unique n-grams (n=1,2) over all n-grams in the generated dialogue responses.Two types of dist (intra-dist and inter-dist) were tested, which are calculated within a single sampled response and between different responses, respectively.For each context in the testset, we generate 10 responses with each model and calculate aforementioned metrics averaged over all responses.Experiment Results.As shown in Table 5, our model yields a stable improvement over most evaluation metrics compared to baselines.Specially, there is a significant improvement on Dist for SW and the BLEU for DD, respectively, indicating that our model can generate relevant, contentful and diverse dialogue responses.There are some metrics where our model does not outperform the state-of-art baselines, but the difference is small.We also show in Table 6 two example responses generated by TWR-VAE and the best baseline iVAE.In the first example, our model can generate more topical relevant responses compared to the responses by iVAE, which implies that the latent variable of TWR-VAE can capture a hidden topic information in the dialogue conversation.In the second example, the generated responses of TWR-VAE are more diverse and contentful than the baseline, and the content of those responses can also provide more topics and facilitate the continuation of the conversation.

Conclusion
In this paper, in order to solve posterior collapse issue of VAE in text modelling, we propose a simple and generic model called Timestep-Wise Regularisation VAE, which imposes the KL regularisation on the latent variables of every timestep of the encoder.Empirical results in language modelling show that our model can give better performance than all baselines while avoiding posterior collapse.Ablation studies show that the timestep-wise regularisation can be easily applied into different RNN-based VAE models and improve their performance.In addition, we evaluate the timestep-wise regularisation in dialogue response generation task, and the results suggest that our model yields better or comparable performance to the state-of-the-art and can generate relevant, contentful and diverse responses. , B The reparameterisation trick for our timestep-wise latent variables If TWR-VAE directly samples z t from the Q φ (z t |x 1:t i ), this sampling behaviour is undifferentiable.A reparameterisation trick was proposed by Kingma and Welling (2014) to solve this issue.Nevertheless, our TWR-VAE samples multiple z t at different timesteps, and we modify the form of each Q φ (z t |x 1:t i ), where the mean and covariance do not directly depend on z t−1 .After using the reparameterisation trick with t ∼ N (0, I), z t can be sampled as: where t ∼ N (0, I), and h t is the hidden state of the LSTM at t timestep.The mean and covariance are calculated via two linear transformation layers with the h t .
C The derivation of the gradients optimisation of θ and φ (Eq.4) When optimising the θ and the φ, we use Monte Carlo method (Metropolis and Ulam, 1949) in order to construct a Monte Carlo estimator, which can obtain unbiased gradients of θ and φ: which is an unbiased Monte Carlo gradient estimator to approximate the expectation (Eq.13), and M indicates the total number of times that we randomly sample z T m from the Q φ (z T m |x 1:t i ) for approximation.
When applying the similar method to obtain the unbiased gradients of φ, there is an obstacle to finishing the gradients: However, we can tackle this issue by using the reparameterisation trick proposed by (Kingma and Welling, 2014).Normally, we choose a differentiable and invertible function g φ (z, ) with the random variable to replace Q φ (z|x i ), namely z = g φ (x, ), where ∼ P ( ) (see Eq. 12).We choose N (0, I) as P ( ) and we can use the Monte Carlo estimator approximate Eq. 18: Overall, the gradients of θ and φ of the ELBO can be re-formed as:

D Training Details for Language Modelling
We represent input data with 512-dimensional word2vec embeddings (Mikolov et al., 2013) and set the dimension of the hidden layers of both 1-layer encoder and decoder to 256.The dimension of the latent variable is 32.There is no gradient clipped during training.The Adam optimiser (Kingma and Ba, 2015) is used for training with an initial learning rate of 1e-4 and a weight decay of 1e-5.Each sentence in a mini-batch is padded to the maximum length for that batch, and the maximum batch-size allowed is 64.

Yahoo
Input 1 wher can i find a poem called " in flight " ? it has something to do with death dunno Input 2 where can i find dinosaur books for my 3 yr old son ?just check with your local library .
BN-VAE α = 0 can can i find a list about " UNK the " ?i is to to do with the . .α = 0.2 can tell me what is the name of the song on the UNK and the UNK ?i think it is a UNK song .α = 0.4 where can i find a list of all the UNK in the world ?i need to find a list of the UNK and UNK of the UNK .α = 0.6 where can i find a list of all the UNK in the world ?i need to find a list of the UNK and UNK of the UNK .α = 0.8 where can i find a list of all the UNK in the world ?i need to find a list of the UNK and UNK of the UNK .α = 1 where can i find a UNK ?free son year old son ?i go out the local library .they TWR-VAE α = 0 where can i find a pic in " in touch attendant ? it has been to do with someone and what α = 0.2 in my opinion what can be done ? it 's a poem for me on myspace .comand some people have no clue α = 0.4 where can i find an old testament to find out how old it was ?i 'm looking at a photograph of albert einstein .α = 0.6 where can i find an old book for someone who has an old son ?i need to know how to do it !!α = 0.8 where can i find info on my research for an anatomy book ?try these links to your local newspaper .good luck α = 1 where can i find info for my son year old son ?try be out your local library .good Table 7: The example of interpolating the latent representation of two input sentences using BN-VAE and TWR-VAE in Yahoo test dataset.
E The derivation of I(x, z) = I(x, z T ) + D KL (Q φ (z T ) P (z T )) , Therefore:

F Training Details for Dialogue Response Generation
Our model follows the implementation details of the CVAE (Zhao et al., 2017).The size of word embedding is 200 and it is initialised from a pre-trained Glove embedding on Twitter (Pennington et al., 2014).The utterance encoder is a one-layer bidirectional GRU with 300 hidden size, and both of the context encoder and the decoder use a one-layer GRU with 300 hidden size.The recognition network is 1-layer feed-forward network and prior network is 2-layer feed-forward network plus a tanh non-linearity for Gaussian prior sampling.The dimension of the latent variable is 200.The context window size J is 10.The initial weights for recognition and prior networks are sampled from a uniform distribution [-0.02, 0.02].The vocabulary size is 10,000 and all out-of-vocabulary words are defined as "< unk >" token.A greedy decoding mode is used to sample dialogue responses in order to ensure that the randomness comes from the latent variables.The entire model is trained using Adam optimiser with an initial learning rate of 1e-4 and a weight decay of 1e-5.Gradient clipping is not used.

G Examples of the latent representation interpolation on the Yahoo test dataset
There are less UNK tokens and repeated words occurring in the interpolated sentences generated by our model compared to BN-VAE, as shown in Table 7. Figure 3 shows that our model has higher ROUGE scores than BN-VAE at α = 0 for reference one and at α = 1 for reference two.Moreover, the ROUGE-L scores of our model are even higher than the ROUGE-1 scores of BN-VAE at α = {0.1,0.2, 0.3} for reference one and at α = {0.7,0.8, 0.9} for reference two.

Figure 1 :
Figure 1: Architectures of the proposed TWR-VAE models and the basic VAE-RNN model.

Figure 2 :
Figure 2: The average ROUGE-1, ROUGE-2 and ROUGE-L F1 scores between two input references and 11 interpolations of each group using BN-VAE and TWR-VAE on Yelp15 test dataset (Appendix G shows the results on Yahoo dataset).

Figure 3 :
Figure 3: The average ROUGE-1, ROUGE-2 and ROUGE-L F1 scores between two input references and 11 interpolations of each group using BN-VAE and TWR-VAE on Yahoo test dataset.

Table 1 :
(Metropolis and Ulam, 1949) robust model learning and can effectively mitigate posterior collapse (see §4 Experiment for detailed discussion).Compared to the HR-VAE ofLi et al., (2019b), our model does not concatenate the cell state of the encoder at each timestep and the dimension of the latent variable of TWR-VAE is only 32, whereas for HR-VAE the dimension is 512 which is much larger.This enables the proposed TWR-VAE model to have fewer parameters than the HR-VAE.In addition, the training speed of the TWR-VAE is six times faster than the HR-VAE by paralleling the timestep-wise KL regularisation.FollowingKingma and Welling (2014), a reparameterisation trick is used to enable the timestep-wise latent variable sampling differentiable.During the gradients optimisation of θ and φ, we use Monte Carlo method(Metropolis and Ulam, 1949)to construct a Monte Carlo estimator, which can obtain unbiased The statistics of the PTB, Yelp 2015, Yahoo, SW and DD datasets.
gradients of θ and φ (see Appendices B and C for the detailed derivation):

Table 2 :
;(2) SA-VAE: A Language modelling results of all baselines and our models on the PTB, Yelp15 and Yahoo test datasets.The results of all baselines are reported based on

Table 3 :
Ablation study results of all variants of our model on the Yelp15 and Yahoo test datasets.