A Hierarchical Latent Structure for Variational Conversation Modeling

Variational autoencoders (VAE) combined with hierarchical RNNs have emerged as a powerful framework for conversation modeling. However, they suffer from the notorious degeneration problem, where the decoders learn to ignore latent variables and reduce to vanilla RNNs. We empirically show that this degeneracy occurs mostly due to two reasons. First, the expressive power of hierarchical RNN decoders is often high enough to model the data using only its decoding distributions without relying on the latent variables. Second, the conditional VAE structure whose generation process is conditioned on a context, makes the range of training targets very sparse; that is, the RNN decoders can easily overfit to the training data ignoring the latent variables. To solve the degeneration problem, we propose a novel model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop regularization. With evaluations on two datasets of Cornell Movie Dialog and Ubuntu Dialog Corpus, we show that our VHCR successfully utilizes latent variables and outperforms state-of-the-art models for conversation generation. Moreover, it can perform several new utterance control tasks, thanks to its hierarchical latent structure.


Introduction
Conversation modeling has been a long interest of natural language research.Recent approaches for data-driven conversation modeling mostly build upon recurrent neural networks (RNNs) (Vinyals and Le, 2015;Sordoni et al., 2015b;Shang et al., 2015;Li et al., 2017;Serban et al., 2016).Serban et al. (2016) use a hierarchical RNN structure to model the context of conversation.Serban et al. (2017) further exploit an utterance latent variable in the hierarchical RNNs by incorporating the variational autoencoder (VAE) framework (Kingma and Welling, 2014;Rezende et al., 2014).
VAEs enable us to train a latent variable model for natural language modeling, which grants us several advantages.First, latent variables can learn an interpretable holistic representation, such as topics, tones, or high-level syntactic properties.Second, latent variables can model inherently abundant variability of natural language by encoding its global and long-term structure, which is hard to be captured by shallow generative processes (e.g.vanilla RNNs) where the only source of stochasticity comes from the sampling of output words.
In spite of such appealing properties of latent variable models for natural language modeling, VAEs suffer from the notorious degeneration problem (Bowman et al., 2016;Chen et al., 2017) that occurs when a VAE is combined with a powerful decoder such as autoregressive RNNs.This issue makes VAEs ignore latent variables, and eventually behave as vanilla RNNs.Chen et al. (2017) also note this degeneration issue by showing that a VAE with a RNN decoder prefers to model the data using its decoding distribution rather than using latent variables, from bits-back coding perspective.To resolve this issue, several heuristics have been proposed to weaken the decoder, enforcing the models to use latent variables.For example, Bowman et al. (2016) propose some heuristics, including KL annealing and word drop regularization.However, these heuristics cannot be a complete solution; for example, we observe that they fail to prevent the degeneracy in VHRED (Serban et al., 2017), a conditional VAE model equipped with hierarchical RNNs for conversation modeling.
The objective of this work is to propose a novel VAE model that significantly alleviates the degen-eration problem.Our analysis reveals that the causes of the degeneracy are two-fold.First, the hierarchical structure of autoregressive RNNs is powerful enough to predict a sequence of utterances without the need of latent variables, even with the word drop regularization.Second, we newly discover that the conditional VAE structure where an utterance is generated conditioned on context, i.e. a previous sequence of utterances, induces severe data sparsity.Even with a large-scale training corpus, there only exist very few target utterances when conditioned on the context.Hence, the hierarchical RNNs can easily memorize the context-to-utterance relations without relying on latent variables.
We propose a novel model named Variational Hierarchical Conversation RNN (VHCR), which involves two novel features to alleviate this problem.First, we introduce a global conversational latent variable along with local utterance latent variables to build a hierarchical latent structure.Second, we propose a new regularization technique called utterance drop.We show that our hierarchical latent structure is not only crucial for facilitating the use of latent variables in conversation modeling, but also delivers several additional advantages, including gaining control over the global context in which the conversation takes place.
Our major contributions are as follows: (1) We reveal that the existing conditional VAE model with hierarchical RNNs for conversation modeling (e.g.(Serban et al., 2017)) still suffers from the degeneration problem, and this problem is caused by data sparsity per context that arises from the conditional VAE structure, as well as the use of powerful hierarchical RNN decoders.
(2) We propose a novel variational hierarchical conversation RNN (VHCR), which has two distinctive features: a hierarchical latent structure and a new regularization of utterance drop.To the best of our knowledge, our VHCR is the first VAE conversation model that exploits the hierarchical latent structure.
(3) With evaluations on two benchmark datasets of Cornell Movie Dialog (Danescu-Niculescu-Mizil and Lee, 2011) and Ubuntu Dialog Corpus (Lowe et al., 2015), we show that our model improves the conversation performance in multiple metrics over state-of-the-art methods, including HRED (Serban et al., 2016), andVHRED (Serban et al., 2017) with existing degeneracy solu-tions such as the word drop (Bowman et al., 2016), and the bag-of-words loss (Zhao et al., 2017).

Related Work
Conversation Modeling.One popular approach for conversation modeling is to use RNN-based encoders and decoders, such as (Vinyals and Le, 2015;Sordoni et al., 2015b;Shang et al., 2015).Hierarchical recurrent encoder-decoder (HRED) models (Sordoni et al., 2015a;Serban et al., 2016Serban et al., , 2017) ) consist of utterance encoder and decoder, and a context RNN which runs over utterance representations to model long-term temporal structure of conversation.
Recently, latent variable models such as VAEs have been adopted in language modeling (Bowman et al., 2016;Zhang et al., 2016;Serban et al., 2017).The VHRED model (Serban et al., 2017) integrates the VAE with the HRED to model Twitter and Ubuntu IRC conversations by introducing an utterance latent variable.This makes a conditional VAE where the generation process is conditioned on the context of conversation.Zhao et al. (2017) further make use of discourse act labels to capture the diversity of conversations.
Degeneracy of Variational Autoencoders.For sequence modeling, VAEs are often merged with the RNN encoder-decoder structure (Bowman et al., 2016;Serban et al., 2017;Zhao et al., 2017) where the encoder predicts the posterior distribution of a latent variable z, and the decoder models the output distributions conditioned on z.However, Bowman et al. (2016) report that a VAE with a RNN decoder easily degenerates; that is, it learns to ignore the latent variable z and falls back to a vanilla RNN.They propose two techniques to alleviate this issue: KL annealing and word drop.Chen et al. (2017) interpret this degeneracy in the context of bits-back coding and show that a VAE equipped with autoregressive models such as RNNs often ignores the latent variable to minimize the code length needed for describing data.They propose to constrain the decoder to selectively encode the information of interest in the latent variable.However, their empirical results are limited to an image domain.Zhao et al. (2017) use an auxiliary bag-of-words loss on the latent variable to force the model to use z.That is, they train an auxiliary network that predicts bag-of-words representation of the target utterance based on z.Yet this loss works in an opposite di-rection to the original objective of VAEs that minimizes the minimum description length.Thus, it may be in danger of forcibly moving the information that is better modeled in the decoder to the latent variable.

Approach
We assume that the training set consists of N i.i.d samples of conversations {c 1 , c 2 , ..., c N } where each c i is a sequence of utterances (i.e.sentences) {x i1 , x i2 , ..., x in i }.Our objective is to learn the parameters of a generative network θ using Maximum Likelihood Estimation (MLE): We first briefly review the VAE, and explain the degeneracy issue before presenting our model.

Preliminary: Variational Autoencoder
We follow the notion of Kingma and Welling (2014).A datapoint x is generated from a latent variable z, which is sampled from some prior distribution p(z), typically a standard Gaussian distribution N (z|0, I).We assume parametric families for conditional distribution p θ (x|z).Since it is intractable to compute the log-marginal likelihood log p θ (x), we approximate the intractable true posterior p θ (z|x) with a recognition model q φ (z|x) to maximize the variational lower-bound: Eq. 2 is decomposed into two terms: KL divergence term and reconstruction term.Here, KL divergence measures the amount of information encoded in the latent variable z.In the extreme where KL divergence is zero, the model completely ignores z, i.e. it degenerates.The expectation term can be stochastically approximated by sampling z from the variational posterior q φ (z|x).The gradients to the recognition model can be efficiently estimated using the reparameterization trick (Kingma and Welling, 2014).et al. (2017) propose Variational Hierarchical Recurrent Encoder Decoder (VHRED) model for conversation modeling.It integrates an utterance latent variable z utt t into the HRED structure (Sordoni et al., 2015a) which consists of three RNN components: encoder RNN, context RNN, and decoder RNN.Given a previous sequence of utterances x 1 , ...x t−1 in a conversation, the VHRED generates the next utterance x t as:

Serban
where At time step t, the encoder RNN f enc θ takes the previous utterance x t−1 and produces an encoder vector h enc t−1 (Eq.3).The context RNN f cxt θ models the context of the conversation by updating its hidden states using the encoder vector (Eq.4).The context h cxt t defines the conditional prior p θ (z utt t |x <t ), which is a factorized Gaussian distribution whose mean µ t and diagonal variance σ t are given by feed-forward neural networks (Eq.5-7).Finally the decoder RNN f dec θ generates the utterance x t , conditioned on the context vector h cxt t and the latent variable z utt t (Eq.8).We make two important notes: (1) the context RNN can be viewed as a high-level decoder, and together with the decoder RNN, they comprise a hierarchical RNN decoder.
(2) VHRED follows a conditional VAE structure where each utterance x t is generated conditioned on the context h cxt t (Eq.5-8).The variational posterior is a factorized Gaussian distribution where the mean and the diagonal variance are predicted from the target utterance and the context as follows: where

The Degeneration Problem
A known problem of a VAE that incorporates an autoregressive RNN decoder is the degeneracy that ignores the latent variable z.In other words, the KL divergence term in Eq. 2 goes to zero and the decoder fails to learn any dependency between the latent variable and the data.Eventually, the model behaves as a vanilla RNN.This problem is first reported in the sentence VAE (Bowman et al., 2016), in which following two heuristics are proposed to alleviate the problem by weakening the decoder.
First, the KL annealing scales the KL divergence term of Eq. 2 using a KL multiplier λ, which gradually increases from 0 to 1 during training: This helps the optimization process to avoid local optima of zero KL divergence in early training.Second, the word drop regularization randomly replaces some conditioned-on word tokens in the RNN decoder with the generic unknown word token (UNK) during training.Normally, the RNN decoder predicts each next word in an autoregressive manner, conditioned on the previous sequence of ground truth (GT) words.By randomly replacing a GT word with an UNK token, the word drop regularization weakens the autoregressive power of the decoder and forces it to rely on the latent variable to predict the next word.The word drop probability is normally set to 0.25, since using a higher probability may degrade the model performance (Bowman et al., 2016).
However, we observe that these tricks do not solve the degeneracy for the VHRED in conversation modeling.An example in Fig. 1 shows that the VHRED learns to ignore the utterance latent variable as the KL divergence term falls to zero.

Empirical Observation on Degeneracy
The decoder RNN of the VHRED in Eq. 8 conditions on two information sources: deterministic  t ]/Var(µ t ) when the decoder is only conditioned on z utt t .The ratio drops to zero as training proceeds, indicating that the conditional priors p θ (z utt t |x <t ) degenerate to separate point masses.
h cxt t and stochastic z utt .In order to check whether the presence of deterministic source h cxt t causes the degeneration, we drop the deterministic h cxt t and condition the decoder only on the stochastic utterance latent variable z utt : While this model achieves higher values of KL divergence than original VHRED, as training proceeds it again degenerates with the KL divergence term reaching zero (Fig. 2).
To gain an insight of the degeneracy, we examine how the conditional prior p θ (z utt t |x <t ) (Eq. 5) of the utterance latent variable changes during training, using the model above (Eq.13).Fig. 2 plots the ratios of E[σ 2 t ]/Var(µ t ), where E[σ 2 t ] indicates the within variance of the priors, and Var(µ t ) is the between variance of the priors.Note that traditionally this ratio is closely related to Analysis of Variance (ANOVA) (Lomax and Hahs-Vaughn, 2013).The ratio gradually falls to zero, implying that the priors degenerate to separate point masses as training proceeds.Moreover, we find that the degeneracy of priors coincide with the degeneracy of KL divergence, as shown in (Fig. 2).This is intuitively natural: if the prior is already narrow enough to specify the target utterance, there is little pressure to encode any more information in the variational posterior for reconstruction of the target utterance.
This empirical observation implies that the fundamental reason behind the degeneration may originate from combination of two factors: (1) strong expressive power of the hierarchical RNN decoder and ( 2) training data sparsity caused by the conditional VAE structure.The VHRED is trained to predict a next target utterance x t conditioned on the context h cxt t which encodes information about previous utterances {x 1 , . . ., x t−1 }.However, conditioning on the context makes the range of training target x t very sparse; even in a large-scale conversation corpus such as Ubuntu Dialog (Lowe et al., 2015), there exist one or very few target utterances per context.Therefore, hierarchical RNNs, given their autoregressive power, can easily overfit to training data without using the latent variable.Consequently, the VHRED will not encode any information in the latent variable, i.e. it degenerates.It explains why the word drop fails to prevent the degeneracy in the VHRED.The word drop only regularizes the decoder RNN; however, the context RNN is also powerful enough to predict a next utterance in a given context even with the weakened decoder RNN.Indeed we observe that using a larger word drop probability such as 0.5 or 0.75 only slows down, but fails to stop the KL divergence from vanishing.

Variational Hierarchical Conversation RNN (VHCR)
As discussed, we argue that the two main causes of degeneration are i) the expressiveness of the hierarchical RNN decoders, and ii) the conditional VAE structure that induces data sparsity.This finding hints us that in order to train a nondegenerate latent variable model, we need to design a model that provides an appropriate way to regularize the hierarchical RNN decoders and alleviate data sparsity per context.At the same time, the model should be capable of modeling complex structure of conversation.Based on these insights, we propose a novel VAE structure named Varia-tional Hierarchical Conversation RNN (VHCR), whose graphical model is illustrated in Fig. 3. Below we first describe the model, and discuss its unique features.We introduce a global conversation latent variable z conv which is responsible for generating a sequence of utterances of a conversation c = {x 1 , . . ., x n }: Overall, the VHCR builds upon the hierarchical RNNs, following the VHRED (Serban et al., 2017).One key update is to form a hierarchical latent structure, by using the global latent variable z conv per conversation, along with local the latent variable z utt t injected at each utterance (Fig. 3): where For inference of z conv , we use a bi-directional RNN denoted by f conv , which runs over the utterance vectors generated by the encoder RNN: where The posteriors for local variables z utt t are then conditioned on z conv : Our solution of VHCR to the degeneration problem is based on two ideas.The first idea is to build a hierarchical latent structure of z conv for a conversation and z utt t for each utterance.As z conv is independent of the conditional structure, it does not suffer from the data sparsity problem.However, the expressive power of hierarchical RNN decoders makes the model still prone to ignore latent variables z conv and z utt t .Therefore, our second idea is to apply an utterance drop regularization to effectively regularize the hierarchical RNNs, in order to facilitate the use of latent variables.That is, at each time step, the utterance encoder vector h enc t is randomly replaced with a generic unknown vector h unk with a probability p.This regularization weakens the autoregressive power of hierarchical RNNs and as well alleviates the data sparsity problem, since it induces noise into the context vector h cxt t which conditions the decoder RNN.The difference with the word drop (Bowman et al., 2016) is that our utterance drop depresses the hierarchical RNN decoders as a whole, while the word drop only weakens the lower-level decoder RNNs.Fig. 4 confirms that with the utterance drop with a probability of 0.25, the VHCR effectively learns to use latent variables, achieving a significant degree of KL divergence.

Effectiveness of Hierarchical Latent Structure
Is the hierarchical latent structure of the VHCR crucial for effective utilization of latent variables?
We investigate this question by applying the utterance drop on the VHRED which lacks any hierarchical latent structure.We observe that the KL divergence still vanishes (Fig. 4), even though the utterance drop injects considerable noise in the context h cxt t .We argue that the utterance drop weakens the context RNN, thus it consequently fail to predict a reasonable prior distribution for z utt (Eq.5-7).If the prior is far away from the re-gion of z utt that can generate a correct target utterance, encoding information about the target in the variational posterior will incur a large KL divergence penalty.If the penalty outweighs the gain of the reconstruction term in Eq. 2, then the model would learn to ignore z utt , in order to maximize the variational lower-bound in Eq. 2.
On the other hand, the global variable z conv allows the VHCR to predict a reasonable prior for local variable z utt t even in the presence of the utterance drop regularization.That is, z conv can act as a guide for z utt by encoding the information for local variables.This reduces the KL divergence penalty induced by encoding information in z utt to an affordable degree at the cost of KL divergence caused by using z conv .This trade-off is indeed a fundamental strength of hierarchical models that provide parsimonious representation; if there exists any shared information among the local variables, it is coded in the global latent variable reducing the code length by effectively reusing the information.The remaining local variability is handled properly by the decoding distribution and local latent variables.
The global variable z conv provides other benefits by representing a latent global structure of a conversation, such as a topic, a length, and a tone of the conversation.Moreover, it allows us to control such global properties, which is impossible for models without hierarchical latent structure.

Results
We first describe our experimental setting, such as datasets and baselines (section 4.1).We then report quantitative comparisons using three different metrics (section 4.2-4.4).Finally, we present qualitative analyses, including several utterance control tasks that are enabled by the hierarchal latent structure of our VHCR (section 4.5).We defer implementation details and additional experiment results to the appendix.
Performance Measures.Automatic evaluation of conversational systems is still a challenging problem (Liu et al., 2016).Based on literature, we report three quantitative metrics: i) the negative log-likelihood (the variational bound for variational models), ii) embedding-based metrics (Serban et al., 2017), and iii) human evaluation via Amazon Mechanical Turk (AMT).

Results of Negative Log-likelihood
Table 1 summarizes the per-word negative loglikelihood (NLL) evaluated on the test sets of two datasets.For variational models, we instead present the variational bound of the negative loglikelihood in Eq. 2, which consists of the reconstruction error term and the KL divergence term.The KL divergence term can measure how much each model utilizes the latent variables.
We observe that the NLL is the lowest by the HRED.Variational models show higher NLLs, because they are regularized methods that are forced to rely more on latent variables.Independent of NLL values, we later show that the latent variable models often show better generalization performance in terms of embedding-based metrics and human evaluation.In the VHRED, the KL divergence term gradually vanishes even with the word drop regularization; thus, early stopping is necessary to obtain a meaningful KL divergence.
The VHRED with the bag-of-words loss (bow) achieves the highest KL divergence, however, at the cost of high NLL values.That is, the variational lower-bound minimizes the minimum description length, to which the bow loss works in an opposite direction by forcing latent variables to encode bag-of-words representation of utterances.
Our VHCR achieves stable KL divergence without any auxiliary objective, and the NLL is lower than the VHRED + bow model.Table 2 summarizes how global and latent variable are used in the VHCR.We observe that VHCR encodes a significant amount of information in the global variable z conv as well as in the local variable z utt , indicating that the VHCR successfully exploits its hierarchical latent structure.

Results of Embedding-Based Metrics
The embedding-based metrics (Serban et al., 2017;Rus and Lintean, 2012) measure the textual similarity between the words in the model response and the ground truth.We represent words using Word2Vec embeddings trained on the Google News Corpus1 .The average metric projects each utterance to a vector by taking the mean over word embeddings in the utterance, and computes the cosine similarity between the model response vector and the ground truth vector.The extrema metric is similar to the average metric, only except that it takes the extremum of each dimension, instead of the mean.The greedy metric first finds the best non-exclusive word alignment between the model response and the ground truth, and then computes the mean over the cosine similarity between the aligned words.
Table 3 compares the different methods with three embedding-based metrics.Each model generates a single response (1-turn) or consecutive three responses (3-turn) for a given context.For 3-turn cases, we report the average of metrics mea- sured for three turns.We use the greedy decoding for all the models.Our VHCR achieves the best results in most metrics.The HRED is the worst on the Cornell Movie dataset, but outperforms the VHRED and VHRED + bow on the Ubuntu Dialog dataset.Although the VHRED + bow shows the highest KL divergence, its performance is similar to that of VHRED, and worse than that of the VHCR model.It suggests that a higher KL divergence does not necessarily lead to better performance; it is more important for the models to balance the modeling powers of the decoder and the latent variables.The VHCR uses a more sophisticated hierarchical latent structure, which better reflects the structure of natural language conversations.

Results of Human Evaluation
Table 4 reports human evaluation results via Amazon Mechanical Turk (AMT).The VHCR outperforms the baselines in both datasets; yet the performance improvement in Cornell Movie Dialog are less significant compared to that of Ubuntu.We empirically find that Cornell Movie dataset is small in size, but very diverse and complex in content and style, and the models often fail to generate sensible responses for the context.The perfor-mance gap with the HRED is the smallest, suggesting that the VAE models without hierarchical latent structure have overfitted to Cornell Movie dataset.

Qualitative Analyses
Comparison of Predicted Responses.Table 5 compares the generated responses of algorithms.Overall, the VHCR creates more consistent responses within the context of a given conversation.This is supposedly due to the global latent variable z conv that provides a more direct and effective way to handle the global context of a conversation.The context RNN of the baseline models can handle long-term context to some extent, but not as much as the VHCR.
Interpolation on z conv .We present examples of one advantage by the hierarchical latent structure of the VHCR, which cannot be done by the other existing models.Table 6 shows how the generated responses vary according to the interpolation on z conv .We randomly sample two z conv from a standard Gaussian prior as references (i.e. the top and the bottom row of Table 6), and interpolate points between them.We generate 3-turn conversations conditioned on given z conv .We see that z conv controls the overall tone and content of conversations; for example, the tone of the response is friendly in the first sample, but gradually becomes hostile as z conv changes.
Generation on a Fixed z conv .We also study how fixing a global conversation latent variable z conv affects the conversation generation.Table 7 shows an example, where we randomly fix a reference z conv from the prior, and generate multiple examples of 3-turn conversation using randomly sampled local variables z utt .We observe that z conv heavily affects the form of the first utterance; in the examples, the first utterances all start with a "where" phrase.At the same time, responses show variations according to different local variables z utt .These examples show that the hierarchical latent structure of VHCR allows both global and fine-grained control over generated conversations.

Discussion
We introduced the variational hierarchical conversation RNN (VHCR) for conversation modeling.We noted that the degeneration problem in existing VAE models such as the VHRED is persistent, and proposed a hierarchical latent variable Table 4: Results of human evaluation via AMT.Human turkers are asked to choose which response is more appropriate in a given context, without knowing which algorithms generate which responses.For each pair of models, we carry out three evaluation batches, each of which consists of 100 random test samples evaluated by five unique humans.We report mean preferences with ±90% confidence interval.
We apply the KL annealing to all variational models, where the KL multiplier λ gradually increases from 0 to 1 over 15,000 steps on Cornell Movie Dialog and over 250,000 steps on Ubuntu Dialog.For both the word drop and the utterance drop, we use drop probability of 0.25.

C Experimental Results
Table 8 -11 shows additional sample generation results.

D Human Evaluation
We perform human evaluation study on Amazon Mechanical Turk (AMT).We first filter out contexts that contain generic unknown word (unk) token from the test set.Using these contexts, we generate model response samples.Samples that contain less than 4 tokens are removed.The order of the samples and the order of model responses are randomly shuffled.Evaluation procedure is as follows: given a context and two model responses, a Turker decides which response is more appropriate in the given context.In the case where the Turker thinks that two responses are about equally good or bad or does not understand the context, we ask the Turker to choose "tie".We randomly select 100 samples to build a batch for a human intelligence test (HIT).For each pair of models, we perform 3 HITs on AMT and each HIT is evaluated by 5 unique humans.In total we obtain 9000 preferences in 90 HITs.  it says error dependency is not satisfiable <unk> that's the error i get is that click the database you <unk> to import it into, and look for the import button.import as sql.→ noticed.this will take me a while because i am not used to this subject and not fluent in it.→ you may need to destroy the tables in the database before importing.
i am not sure what you mean.i am using the default ubuntu install.
it is a <unk> of the <unk>.
i have a <unk> <unk> and i want to be able to use the database on the <unk>.i am i dont think that s what you are saying is it isnt an option then how to do that?i'm trying to setup a shortcut dir of this path home user downloads in this path jail <unk> site how do i do this?i tried using → ln s home user downloads jail <unk> site downloads if you're trying to do something with a chroot jail guessing from the path you can't symlink outside → thank you for the answer.but making a vlan for each user and then limit the vlan is not possible when you only have ssh root access right i'm not sure if that's possible, but i'm not sure if it's possible to do that.<unk> is the user that owns the home directory i'd say that's the case ... i've never even used a shell to draw the data with the destination folder look at the permissions of the folder and the other is <unk>, yes hi i am facing a problem! while upgrading my ubuntu to 12.04 lts unfortunately shutdown the system and after that when i reboot my system it shows black → did the shutdown occur in the middle of the upgrade?→ yes, one of my friend just shut down, he was not aware that upgrade is going on now the system freezes at seemingly random points and leaves i 'm not sure what you mean by 'shutdown'.
not sure what the problem is, try sudo apt get update sudo apt get upgrade sudo apt get upgrade sudo apt get upgrade it has been interrupted.so you are saying that it is not able to boot the cd or something?

Figure 1 :
Figure 1: Degeneration of VHRED.The KL divergence term continuously decreases as training proceeds, meaning that the decoder ignores the latent variable z utt .We train the VHRED on Cornell Movie Dialog Corpus with word drop and KL annealing.

Figure 2 :
Figure 2: The average ratio E[σ 2 t ]/Var(µ t ) when the decoder is only conditioned on z utt t .The ratio drops to zero as training proceeds, indicating that the conditional priors p θ (z utt t |x <t ) degenerate to separate point masses.

Figure 3 :
Figure 3: Graphical representation of the Variational Hierarchical Conversation RNN (VHCR).The global latent variable z conv provides a global context in which the conversation takes place.

Figure 4 :
Figure 4: The comparison of KL divergences.The VHCR with the utterance drop shows high and stable KL divergence, indicating the active use of latent variables.w.d and u.d denote the word drop and the utterance drop, respectively.

Table 1 :
Results of Negative Log-likelihood.The inequalities denote the variational bounds.w.d and u.d., and bow denote the word drop, the utterance drop, and the auxiliary bag-of-words loss respectively.

Table 3 :
Results of embedding-based metrics.1turn and 3-turn responses of models per context.

Table 5 :
Qualitative comparison of generated responses.Top two rows show the samples from Cornell Movie Dialog, while the bottom two rows are from Ubuntu Dialog.

Table 6 :
An example of interpolated 3-turn responses over z conv on Cornell Movie Dialog.

Table 7 :
An example of 3-turn responses conditioned on sampled z utt for a single fixed z conv .where is she? → she's the only one who knows where she is, she's going to be all right.→ oh, you're the only one who's gon na be.she's a <unk>.
where's my wife?→ you've got to get out of here, you know?you're the one who's gon na be here.

Table 8 :
An example of interpolated 3-turn responses over z conv on Cornell Movie Dialog.

Table 9 :
An example of 3-turn responses conditioned on sampled z utt for a single fixed z conv .
→ you're a good man, are n't you?

Table 10 :
Comparison of generated responses on Cornell Movie Dialog.don'tthink the people who live around here would like that very much ... → do you think they like the fact that <unk> has dropped ninety eight percent in the last ten years?→ no, but ... one of these women thinks she should be the next angel of the month.it'syourjob, ted, to decide which of them is → let me get this straight ... → if it isn't straight by now, son, you probably should see a doctor.no,no,no, no, no,  no, no, no, no, no,  no, no, no, no,no, ... i'm not a <unk>, i'm a <unk> ... you look dead.it's not the same thing.it's not the truth.like following me around and <unk> the people i work with wasn't bad enough, but breaking into my apartment → it was open.→ you got a lotta nerve.
i every

Table 11 :
Comparison of generated responses on Ubuntu Dialog. .i download the <unk>, and want to install it with sudo dpkg i google earth stable current i386.deb, it tells me google earth → the error suggests running sudo apt get f install → i tried that, it comes the same error