A Semi-Supervised Stable Variational Network for Promoting Replier-Consistency in Dialogue Generation

Neural sequence-to-sequence models for dialog systems suffer from the problem of favoring uninformative and non replier-specific responses due to lack of the global and relevant information guidance. The existing methods model the generation process by leveraging the neural variational network with simple Gaussian. However, the sampled information from latent space usually becomes useless due to the KL divergence vanishing issue, and the highly abstractive global variables easily dilute the personal features of replier, leading to a non replier-specific response. Therefore, a novel Semi-Supervised Stable Variational Network (SSVN) is proposed to address these issues. We use a unit hypersperical distribution, namely the von Mises-Fisher (vMF), as the latent space of a semi-supervised model, which can obtain the stable KL performance by setting a fixed variance and hence enhance the global information representation. Meanwhile, an unsupervised extractor is introduced to automatically distill the replier-tailored feature which is then injected into a supervised generator to encourage the replier-consistency. Experimental results on two large conversation datasets show that our model outperforms the competitive baseline models significantly, and can generate diverse and replier-specific responses.


Introduction
Dialog systems, aiming at generating relevant and fluent responses in the replier-consistent way, have received increasing attention due to its numerous applications (Grosz, 2016;Chen et al., 2017a). Recently, Seq2Seq neural networks (Sutskever et al., 2014) have demonstrated excellent results on open-domain conversation (Shang et al., 2015; * * Equal contribution. † † Corresponding author. Sordoni et al., 2015;Vinyals and V. Le, 2015;Yao et al., 2015). However, due to lacking of the global and relevant information guidance, they inherently tend to generate trivial and uninformative responses (e.g., "I don't know"), rather than meaningful and replier-specific ones Li et al., 2016). The existing methods based on neural variational methods with Gaussian , are proposed to use a latent variable as the global information in decoder to strengthen the generation (Serban et al., 2017;Zhao et al., 2017;Chen et al., 2018). However, they face the problems of (1) latent space futility and (2) replier-consistency decay.
(1) The model tends to select more gain from a lower Kullback-Leibler (KL) divergence during training, which encourages the approximate posterior close to Gaussian prior, rendering the latent space of the former unused. Thus, the latent variables on this space become worthless global guidance for decoder. To address this issue, most previous work (Xie et al., 2017;Yang et al., 2017;Chen et al., 2017) has suggested a weaker decoder to match the Gaussian samples, which essentially sacrifice the generative capacity. (2) The speakers in a dyadic conversation have different linguistic characteristics, sentiments and personalities. However, the latent variable is learned conditioned on the holistic context without any distinction between speakers, especially the replier. This will dilute the personal features of replier and lead to a decrease in replier-consistency. Current methods (Li et al., 2016;Zhang et al., 2018) normally recur to artificially scheduled personal information to promote the replier-consistency, but they cannot be migrated to the other datasets.
Inspired by the effectiveness of vMF distribution in solving the KL-vanishing in the unsupervised scene (Xu and Durrett, 2018) and the suc-cess of Variational Auto-Encoder (VAE) in capturing latent feature of the real data (Davidson et al., 2018), we propose a Semi-Supervised Stable Variational Network (SSVN) framework to address the above issues. It consists of an unsupervised personal feature extractor (a VAE with vMF) and a supervised information-enhanced generator (a CVAE with vMF). To maintain the consistency of replier features, the extractor only encodes the previous utterances from the replier and produces a personally tailored latent variable. On the top of this, the generator fuses the replier-tailored latent variable and the self vMF distributed global information to facilitate the diverse and replier-specific responses.
In general, our contributions are as follows: • A semi-supervised stable variational network is proposed to solve the latent space futility issue and promote the replier-consistency.
• To the best of our knowledge, our model is the first to use the vMF distribution in a semisupervised framework for dialogue generation, which can enhance the global information by alleviating the KL divergence vanishing problem.
• An unsupervised personal feature extractor is designed to acquire the replier-specific features automatically.
• The experimental results on two large conversation datasets validate the effectiveness of our model.
• It is shown that the different roles of vMF on extractor and generator. We suprisingly find that the extractor can alleviate the KLvanishing to some extent.
2 Related Work

Neural Variational Network
Variational autoencoder (VAE)  is one of the most popular generative models. The principle idea is to encode the data x to learn a probability distribution z, then sample the latent variables from z and inject them into a directed decoder network to reconstruct x. The model parameters are optimized by maximizing a reparameterized variational lower bound. Based on this process, the conditional VAE (CVAE) (Sohn et al., 2015) can be conditioned on certain attributes to improve diversity. In diaog generation task, Serban et al. (2017) employs the CVAE to acquire a global latent variable as a holistic representation in a hierarchical setting. Zhao et al. (2017) regards the latent variable as a global dialog act information and directly feed it to the decoder to control the dialog act of a response. To maintain the long-term memory of the previous utterances, Chen et al. (2018) utilizes the higher-level abstract variable to retrieve and update memory cells.

Latent Space Futility
As for the latent space futility issue, also called KL-vanishing in Shen et al. (2018), most previous work has suggested a weaker decoder to encourage the simple Gaussian samples to be leveraged, such as a word drop-out technique in decoder (Xie et al., 2017;Zhao et al., 2017) or a practice of replacing RNN decoder with a CNN counterpart (Yang et al., 2017;Chen et al., 2017;Semeniuta et al., 2017). These methods are contrary to our origiral intention due to generation capacity descending. Other efforts focus on changing prior and posterior: Rezende and Mohamed (2015) and Kingma et al. (2016) utilize a normalizing flow to transform the sampled variables; Shen et al. (2018) introduces an AE module to complicate the quondam distribution. Extending the latter direction yet without increasing the model complexity, we only leverage the vMF distribution instead of the simple Gaussian to strengthen the KL term. Unlike the single vMF-based VAEs implemented in other cases (Davidson et al., 2018;Guu et al.;Xu and Durrett, 2018), we apply vMF into a semisupervised dialog model to generate diverse and replier-specific responses.

Replier-Consistency Decay
In order to emphasize the replier-consistency, Li et al. (2016) captures personas' characteristics of Twitter users by encapsulating background information and speaking style into the distributed embeddings (one per user), which are used to improve consistency for the same person. Zhang et al. (2018) presents a persona-provided dialogue dataset and trains dialog models conditioned on their given configurable, but persistent profile information. However, the above work relies heavily on manually scheduled persona information and has difficulties in migrating to other common conversation corpora. In contrast to this, our work  focuses on automatically extracting the individual features of replier from the original conversation text, to enhance the replier-consistency of responses, without any corpus restriction.

Task Description
Given a series of dialogue context utterances (x 1 , x 2 , ..., x n ), where x i = (w i,1 , w i,2 , ..., w i,N i ), our task is to generate a response y = (w y,1 , w y,2 , ..., w y,Ny ) that not only rely on the global information but also consider the personally special features from the replier. In this paper, we employ the vMF distribution to stimulate the potential of latent space, impelling the extractor to condense a feature-augmented individual information and the generator to generalize a useful global guidance. The overview of SSVN is illustrated in Figure 1.

von Mises-Fisher
The von Mises-Fisher (vMF) places a distribution over points on the unit hypersphere, parameterized by a direction vector µ ∈ R d indicating the mean direction and a concentration parameter κ ∈ R ≥0 . The PDF of the vMF distribution for the unit vector z ∈ R d is defined as: where µ = 1, C d is the normalization constant, and I ρ stands for the modified Bessel function of the first kind at order ρ.

Personal Feature Extractor
To enhance the replier-consistency, the personal feature extractor, implemented by a VAE with vMF, encodes rustically the context utterances from replier Figure 1 into a random latent variable z r , based on which the decoder reconstructs x r . Due to an intractable highdimensional integral problem over the latent variable z r , we set a recognition network q φe (z r |x r ) as a variational approximation to the true posterior p(z r |x r ), then apply variational inference to optimize the evidence lower bound (ELBO) as: Utterance & Local Context Encoder Concretely, we employ a hierarchical encoder to encode x r : the utterance encoder based on bidirectional RNN (Schuster and Paliwal, 1997) determinitically reads each utterance x r i and output a size- , which the local context encoder takes as input to obtain the final hidden state h r l as the summary of x r . Prior/Posterior Distribution Since we assume the latent space follows vMF distribution, the prior p θe ((z r ) ∼ vM F (·, κ e prior = 0) and the variational posterior q φe (z r |x r ) ∼ vM F (µ e pos , κ e pos ) where µ e pos is the output of the recognition net-work and κ e pos is set to a constant. where f e pos (·) is a linear transformation, and · stands for 2-norm to ensure the normalization.
With the uniform distribution as our prior, the KL divergence can be computed as: Since Eq. 6 only depends on fixed constant κ e pos , not on µ e pos , this term can resolve the latent space futility problem by averting the KL-zeroing. Replier Decoder During reconstruction, the decoder receives the concatenation of replier's context h r l and personal latent variable z r as the initial hidden state, then generates tokens sequentially under the following probability distribution: where l is the number of turns of the replier's context x r ; N i is the length of the i-th utterance (x r i ) in x r ; w i,j is the j-th token in x r i .

Information-Enhanced Generator
Similar to the extractor, the information-enhanced generator based on CVAE also employs a recognition network q φg (z|x, y) to approximate the true posterior p(z|x, y), correspondingly, its ELBO can be calculated as: when considering an external personal feature z r , the ELBO in generator would be rewritten as: Notice that z r only participates in the generation process p θg (y|x, z, z r ), the approximate posterior q φg (z|x, y, z r ) ∼ vM F (µ g pos , κ g pos ) is conditioned on dialog context x and the corresponding response y, and the prior p θg (z|x, z r ) ∼ vM F (µ g prior , κ g prior ) depends on x 1 . Utterance & Golbal Context Encoder The hierarchical encoder in this part utilizes the shared utterance encoder from extractor to encode utterances x 1 , x 2 , ..., x n , y into the corresponding representations h u 1 , h u 2 , ..., h u n , h u y orderly. Thereafter, the utterance vectors h u 1 , h u 2 , ..., h u n are fed to the global context encoder to compute the representation of the whole dialog context h c n . Based on these, the approximate posterior and prior can be determined by the following operations: where f g pos (·) and f g prior (·) are both linear transformations. κ g pos and κ g prior in both distributions are the constants with equal values. Prior/Posterior Distribution Without vM F (·, 0) as the prior, we require to recalculate the KL term in generator as: Response Decoder We employ a RNN decoder similar to the one in extractor, extending it to condition on a personal feature z r by concatenating z r to the input of the decoder at each time step. The concrete generative process is as follows: where σ is the sigmoid function; e w y,i is the word embedding of the i-th word in response y; s R t denotes the hidden state at the time step t; V and b are learnable parameters; p vocab stands for the probability distribution over the vocabulary. Then the objective function of the decoder is given by: where p vocab (w y,i ) is the probability to generate the word w y,i ; N y is the length of the response y.

Training Objective
The entire SSVN model integrates two modules in Figure 1, i.e., the unsupervised extractor and the supervised generator, which can be optimized simultaneously in one framework. Thus, the overall objective function of SSVN is to maximize: where we have a hyperparameter λ to control the balance between response generation (generator) and personality reconstruction (extractor).

Sampling Techique for vMF
Similar to Xu and Durrett (2018), we utilize the rejection sampling scheme of Wood (1994) to sample a value w ∈ [−1, 1], then derive a random unit vector tangent υ on the hypersphere at the mean vector µ. Based on these, our latent variable z can be given by z = wµ + υ √ 1 − w 2 .

Datasets
The proposed model is evaluated on two datasets. The first corpus is Cornell Movie Dialogs Corpus 2 (Danescu-Niculescu-Mizil and Lee, 2011) that contains more than 80,000 imagined movie conversations. To normalize the length (turns) of the dialogs, we divide the original conversations into consecutive 3-10 utterances. Our second dataset is Ubuntu Dialogue Corpus 3 (Lowe et al., 2015). It contains about 500,000 multi-turn dialogues collected from the Ubuntu Internet Relayed Chat channel, each of which starts with a Ubunturelated technical problem and follows by the corresponding responses about solutions.
ristian/Cornell Movie-Dialogs Corpus.html. 3 We use the same train-validation-test split as in Chen et al. (2018).
In the above two datasets, the last utterance in a conversation is regarded as the response and the remaining ones are the input context. The detailed statistical information is shown in Table 1.

Baselines
We compare SSVN with the following models: S2SA: the standard Seq2Seq model with the attention mechanism (Vinyals and V. Le, 2015).
HRED: a hierarchical encoder framework to model multi-turn dialogs .
HVMN: an encoder-decoder network containing the hierarchical structure and the variational memory (Chen et al., 2018).

Experimental Details
Our model is implemented using the Tensorflow framework (Abadi et al., 2016) with the following parameter settings: We set word embeddings to size of 200 and initialize them randomly. The shared utterance encoder is a 2-layer bidirectional GRU structure with 600 hidden neurons for each layer, while the both context encoders and the both decoders are the unidirectional ones with hidden size of 600. The dimensions of the latent variable z and z r are both set to 50. We use the Adam algorithm (Kingma and Ba, 2014) to update the parameters with an initial learning rate of 0.001. In the training, we employ the early-stop strategy (Caruana et al., 2000) to select the best models using the variational lower-bound on the validation set.

Evaluation Metrics
We use both automatic and human evaluations to analyze the model's performance.
Automatic Evaluation Metrics In our experiment, three embedding-based metrics (Average, Greedy, Extreme) 4 (Liu et al., 2016) are employed to measure the semantic relevance between generated responses and ground truths. Besides, we also adopt Distinct-1 and Distinct-2 (Li et al., 2016) to evaluate the diversity of responses.
Human Evaluation In order to assess how well the models can maintain the replier's consistency, we conduct a human evaluation. Specifically, we randomly sample 300 context from the test set and apply 5 models to generate responses for each context. For each response, three annotators are re-   cruited to give a 4-graded judgement with the following criteria: 1: the response is ungrammatical or semantically irrelevant; or inconsistent with replier's features (e.g., linguistic characteristics, sentiments and personalities); or has wrong logic; 2: the response is semantically weak related, but it is too trivial (e.g., "I don't know"); 3: the response is semantically relevant and informative, but has no obvious consistency about the replier's personal features; 4: the response is not only semantically related and informative, but also consistent with the individual features of replier.

Evaluation Results
Automatic Evaluation The metric-based evaluation results are shown in Table 2. From the results, we can observe that: (1) HRED performs better than S2SA, indicating that the hierarchical structure is benefical.
(2) VHRED outperforms HRED on all metrics on Cornell, which demonstrates that the latent variables are the useful global guidance information. Inversely, VHRED has a worse performance than HRED in terms of three embedding-based metrics on Ubuntu, which is consistent with Chen et al. (2018) due to the domain specific dataset.
(3) On the top of VHRED, HVMN introduces the memory network to enhance the long-term memory, and obtains the best performance among the baseline models.
(4) Compared with all the baselines, our SSVN model achieves the highest scores in terms of all metrics on two datasets, indicating that SSVN can best fit the ground truth semantically and generate more informative responses. Meanwhile, the sign tests show that the improvements of SSVN are statistically significant (p-value<0.01).
(5) Noticeably, the models trained on Ubuntu consistently have more distinct n-grams than the same models trained on Cornell, while the distinct ratios do not differ much. The reason is that Ubuntu dataset has more words averagely per utterance than Cornell data (as the statistical details shown in Table 1), which forces the models to produce longer responses.

Human Evaluation
The human evaluation results on Cornell data are shown in Table 4, in which the score distribution values represent the percentages of responses belonging to each grade, and Fleiss' kappa (Fleiss and Cohen, 1973) is employed to evaluate the inter-annotator agreement. From the results, we have the following observations: (1) The percentage of replier-specific responses (i.e., the grade '4') of SSVN model is 22.69%, which is much higher than that of baselines, indicating that the personal feature extractor can effectively capture the personal feature of replier.
(2) SSVN model generates much more informative responses (i.e., 71.56% labeled as '3+4') and much less generic responses (i.e., 20.28% labeled as '2') than all the baselines. The results are in line with the above results of metric-based evaluation.
(3) Kappa scores of the models are all higher than 0.4, demonstrating that the annotators come to a fair agreement. Meanwhile, the sign tests also show that the human evaluation improvements of SSVN to baselines are significant on Cornell dataset (p-value<0.01).

Model
Extractor Generator Average Greedy Extreme Distinct-

Discussions
Model Ablation To investigate the effect of different parts, we conduct a set of experiments on Cornell by removing the extractor or modifying the distribution of extractor and generator. From the results listed in Table 3, we can observe that: (1) Removing the extractor (denoted as SVN) makes the distinct ratios and numbers drop dramatically, while the embedding-based metric scores are only slightly lower than that of SSVN. This indicates the personal features learned by the extractor not only maintain the replierconsistency, but also improve the diversity of responses. In addition, SSVN Gau−E (replacing the vMF distribution with a Gaussian in extractor) has a better performance than SVN, but a worse one than SSVN, demonstrating the vMF-Extractor is more effective than Gau-Extractor.
(2) As for the generator, when setting the Gaussian as the latent space (denoted as SSVN Gau−G ), the embedding-based performance deteriorates dramatically whereas the distinct numbers decrease slightly, indicating that the vMF-Generator is more capable of facilitating the generated responses semantically close to the ground truth than Gau-Generator. Notably, the distinct ratios in SSVN Gau−G rise unexpectedly, which will be investigated in (3).
(3) To figure out this special phenomenon, we conduct an experiment on SSVN Gau composed by Gau-Extractor and Gau-Generator. We can see that the SSVN Gau−G and SSVN Gau obtain the best distinct ratios among the ablation models, but their distinct numbers are not the highest. The results indicate that, whatever the latent space of extractor follows, the Gau-Generator always tends to produce informative but very short responses. Meanwhile, their worst embeddingbased scores show that the responses generated by Gau-Generator semantically deviate from the ground truth significantly. The Effect of vMF on KL Besides the metricbased performance, we also evaluate the effectiveness of different settings in sloving the latent space futility problem. Figure 2 visualizes the evolution of the KL loss in both extractor and generator parts. We can see that: (1) In Extractor KL, Gau-Extractors (i.e., SSVN Gau and SSVN Gau−E ) have a KL cost close to 0 at the begining and never recover, while the vMF-Extractors (i.e., SSVN Gau−G and SSVN) can keep a constant KL value as evidenced by Eq.(6). The results indicate that the vMF in extractor can mitigate KL-vanishing and capture the more meaningful personal features.
(2) Surprisingly, in Generator KL, the KL loss presents an upward trend in Gau-Generators (i.e., SSVN Gau and SSVN Gau−G ). The reason is that, the personal features from the extractor can effectively strengthen the expressiveness of the latent space in the generator, thus the response decoder is encouraged to exploit the latent variables and the latent space futility problem is alleviated.
(3) Compared with the Gau-Generators in Generator KL, the vMF-Generators (i.e., SVN, SSVN Gau−E and SSVN) have the much higher KL values, indicating that the vMF is a better selection than Gaussian to solve the KL-vanishing problem. Meanwhile, the KL values are relatively stable, which experimentally demonstrates the KL cost mainly depends on the last term in Eq.(14) and the variable term has little effect on it. Last but not least, KL cost can explictly be changed by setting different kappa values. Impact of the Coefficient λ Recall that Eq. (18) shows the capacity of SSVN in balancing response generation and personality reconstruction. Here we analyze the effects of different coefficient λ on the quality of responses. Figure 3 shows the performances given varying λ. Notably, the performances of embedding-based metrics are changing in a similar trend, as the same case in Distinct-1 and Distinct-2, thus we only consider Average and Distinct-1 as the major analysis items. As we can see, the evolutions of Average and Distinct-1 in Figure 3 can be broadly into three stages: generation adynamic stage, mutual promotion stage and reconstruction rout stage.
(1) The first stage shows that as λ increases, the Average monotonically increases while diversity decreases. This is because the lower λ gives the model less incentive to optimize the generator, which makes the response decoder incapable of utilizing the higher-quality personal features, resulting in the diverse but semantically inappropriate responses.
(2) When λ moves to the second stage, the performances of the Average and diversity improve simultaneously, implying that the response gener-ation and personality reconstruction achieve en expected balance.
(3) For the reconstruction rout stage, although the model focuses on response generation, the larger λ does not bring an improvement of Average, but instead increases diversity. The result indicates that the unprofitable personal features in thwarted personality reconstruction part, as a random disturbance, can increase the diversity of the responses, but severely bias the generation of response decoder semantically.
Observed from the curves of all metrics, the best performance of embedding-based metrics is achieved at λ = 0.7, while the diversity reaches the peak in the mutual promotion stage. Thus, we set λ to 0.7 in all previous experiments. Case Study Besides the quantitative analysis, we also organize some examples (seen in Table 5) from different models to analyze the performances of the methods qualitatively. They are chosen randomly from the responses produced by the proposed model, and showed together with the corresponding contexts and the outputs of the baselines. From the case 1, we can observe that the SSVN can extract the personal feature of the replier that the speaker A prefers to acquire further information from others, which guides the generator to produce an interrogative response to promote the replier-consistency. Meanwhile, the SSVN can also extract the firm attitude of the replier in case 2 and the pleading tone of the replier in case 3. By contrast, the baselines favor to produce the bad responses, such as containing more 'unk'. Error Analysis To improve the performance of SSVN in the future, we take the worse cases (i.e., the grade '1' and '2') in human judgement as an example to analyze our errors. Specifically, we divide the errors with the grade '1' into grammatical error, replier-nonconsistency and logic contradiction, which occupy 19.79%, 31.49% and 48.72%, respectively. We can find that 1) logic condradiction scenes make up most of the errors as SSVN pays little attention to this issue. 2) although considering the personal features from replier, there still exists 31.49% replier-inconsistent cases, incidating that only strengthening the VAE with the vMF distribution may not be a perfect approach for personal feature extraction. As for grade '2', the consistency of replier's features improve the of response diversity significantly, but the model still has "safe response" problem as the baseline.  The above analysis sheds light on our future directions: 1) modeling the logic consistency between the context and response; 2) exploring advanced methods for extracting personal features; 3) improving the response diversity.

Conclusion and Future Work
In this work, we propose a semi-supervised stable variational network for addressing the latent space futility and replier-consistency decay issues. Different from the traditional variational dialog models, the proposed model selects the vMF as the prior and posterior to resolve the latent space futility issue, and then integrates a unsupervised extractor to obtain the replier-tailored personal features to ensure the replier-consistency. Experimental results on two dialog datasets demonstrate the effectiveness of our model, especially on replier-consistency in terms of human evaluation. However, the error analysis shows that there are still challenges in dialogue generation, which we would like to explore in the future.