Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation

The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-decoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DI-VAE and DI-VST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on real-world dialog datasets to discover semantic representations and enhance encoder-decoder models with interpretable generation.


Introduction
Classic dialog systems rely on developing a meaning representation to represent the utterances from both the machine and human users (Larsson and Traum, 2000;Bohus et al., 2007). The dialog manager of a conventional dialog system outputs the system's next action in a semantic frame that usually contains hand-crafted dialog acts and slot values (Williams and Young, 2007). Then a natural language generation module is used to generate the system's output in natural language based on the given semantic frame. This approach suffers from generalization to more complex domains because it soon become intractable to man-ually design a frame representation that covers all of the fine-grained system actions. The recently developed neural dialog system is one of the most prominent frameworks for developing dialog agents in complex domains. The basic model is based on encoder-decoder networks  and can learn to generate system responses without the need for hand-crafted meaning representations and other annotations. Although generative dialog models have advanced rapidly (Serban et al., 2016;Li et al., 2016;, they cannot provide interpretable system actions as in the conventional dialog systems. This inability limits the effectiveness of generative dialog models in several ways. First, having interpretable system actions enables human to understand the behavior of a dialog system and better interpret the system intentions. Also, modeling the high-level decision-making policy in dialogs enables useful generalization and dataefficient domain adaptation (Gašić et al., 2010). Therefore, the motivation of this paper is to develop an unsupervised neural recognition model that can discover interpretable meaning representations of utterances (denoted as latent actions) as a set of discrete latent variables from a large unlabelled corpus as shown in Figure 1. The discovered meaning representations will then be integrated with encoder decoder networks to achieve interpretable dialog generation while preserving all the merit of neural dialog systems.
We focus on learning discrete latent representations instead of dense continuous ones because discrete variables are easier to interpret (van den Oord et al., 2017) and can naturally correspond to categories in natural languages, e.g. topics, dialog acts and etc. Despite the difficulty of learning discrete latent variables in neural networks, the recently proposed Gumbel-Softmax offers a reliable way to back-propagate through discrete variables (Maddison et al., 2016;Jang et al., 2016). However, we found a simple combination of sentence variational autoencoders (VAEs) (Bowman et al., 2015) and Gumbel-Softmax fails to learn meaningful discrete representations. We then highlight the anti-information limitation of the evidence lowerbound objective (ELBO) in VAEs and improve it by proposing Discrete Information VAE (DI-VAE) that maximizes the mutual information between data and latent actions. We further enrich the learning signals beyond auto encoding by extending Skip Thought (Kiros et al., 2015) to Discrete Information Variational Skip Thought (DI-VST) that learns sentence-level distributional semantics. Finally, an integration mechanism is presented that combines the learned latent actions with encoder decoder models.
The proposed systems are tested on several realworld dialog datasets. Experiments show that the proposed methods significantly outperform the standard VAEs and can discover meaningful latent actions from these datasets. Also, experiments confirm the effectiveness of the proposed integration mechanism and show that the learned latent actions can control the sentence-level attributes of the generated responses and provide humaninterpretable meaning representations.

Related Work
Our work is closely related to research in latent variable dialog models. The majority of models are based on Conditional Variational Autoencoders (CVAEs) (Serban et al., 2016;Cao and Clark, 2017) with continuous latent variables to better model the response distribution and encourage diverse responses.  further introduced dialog acts to guide the learning of the CVAEs. Discrete latent variables have also been used for task-oriented dialog systems (Wen et al., 2017), where the latent space is used to represent intention. The second line of related work is enriching the dialog context encoder with more fine-grained information than the dialog history. Li et al., (2016) captured speakers' characteristics by encoding background information and speaking style into the distributed embeddings. Xing et al., (2016) maintain topic encoding based on Latent Dirichlet Allocation (LDA) (Blei et al., 2003) of the conversation to encourage the model to output more topic coherent responses.
The proposed method also relates to sentence representation learning using neural networks. Most work learns continuous distributed representations of sentences from various learning signals (Hill et al., 2016), e.g. the Skip Thought learns representations by predicting the previous and next sentences (Kiros et al., 2015). Another area of work focused on learning regularized continuous sentence representation, which enables sentence generation by sampling the latent space (Bowman et al., 2015;Kim et al., 2017). There is less work on discrete sentence representations due to the difficulty of passing gradients through discrete outputs. The recently developed Gumbel Softmax (Jang et al., 2016;Maddison et al., 2016) and vector quantization (van den Oord et al., 2017) enable us to train discrete variables. Notably, discrete variable models have been proposed to discover document topics (Miao et al., 2016) and semi-supervised sequence transaction (Zhou and Neubig, 2017) Our work differs from these as follows: (1) we focus on learning interpretable variables; in prior research the semantics of latent variables are mostly ignored in the dialog generation setting.
(2) we improve the learning objective for discrete VAEs and overcome the well-known posterior collapsing issue (Bowman et al., 2015;Chen et al., 2016). (3) we focus on unsupervised learning of salient features in dialog responses instead of hand-crafted features.

Proposed Methods
Our formulation contains three random variables: the dialog context c, the response x and the latent action z. The context often contains the discourse history in the format of a list of utterances. The response is an utterance that contains a list of word tokens. The latent action is a set of discrete variables that define high-level attributes of x. Before introducing the proposed framework, we first identify two key properties that are essential in or-der for z to be interpretable: 1. z should capture salient sentence-level features about the response x.
2. The meaning of latent symbols z should be independent of the context c.
The first property is self-evident. The second can be explained: assume z contains a single discrete variable with K classes. Since the context c can be any dialog history, if the meaning of each class changes given a different context, then it is difficult to extract an intuitive interpretation by only looking at all responses with class k ∈ [1, K]. Therefore, the second property looks for latent actions that have context-independent semantics so that each assignment of z conveys the same meaning in all dialog contexts.
With the above definition of interpretable latent actions, we first introduce a recognition network R : q R (z|x) and a generation network G. The role of R is to map an sentence to the latent variable z and the generator G defines the learning signals that will be used to train z's representation. Notably, our recognition network R does not depend on the context c as has been the case in prior work (Serban et al., 2016). The motivation of this design is to encourage z to capture context-independent semantics, which are further elaborated in Section 3.4. With the z learned by R and G, we then introduce an encoder decoder network F : p F (x|z, c) and and a policy network π : p π (z|c). At test time, given a context c, the policy network and encoder decoder will work together to generate the next response viã x = p F (x|z ∼ p π (z|c), c). In short, R, G, F and π are the four components that comprise our proposed framework. The next section will first focus on developing R and G for learning interpretable z and then will move on to integrating R with F and π in Section 3.3.

Learning Sentence Representations from Auto-Encoding
Our baseline model is a sentence VAE with discrete latent space. We use an RNN as the recognition network to encode the response x. Its last hidden state h R |x| is used to represent x. We define z to be a set of K-way categorical variables During training, we use the Gumbel-Softmax trick to sample from this distribution and obtain lowvariance gradients. To map the latent samples to the initial state of the decoder RNN, we define {e 1 ...e m ...e M } where e m ∈ R K×D and D is the generator cell size. Thus the initial state of the generator is: Finally, the generator RNN is used to reconstruct the response given h G 0 . VAEs is trained to maxmimize the evidence lowerbound objective (ELBO) (Kingma and Welling, 2013). For simplicity, later discussion drops the subscript m in z m and assumes a single latent z. Since each z m is independent, we can easily extend the results below to multiple variables.

Anti-Information Limitation of ELBO
It is well-known that sentence VAEs are hard to train because of the posterior collapse issue. Many empirical solutions have been proposed: weakening the decoder, adding auxiliary loss etc. (Bowman et al., 2015;Chen et al., 2016;. We argue that the posterior collapse issue lies in ELBO and we offer a novel decomposition to understand its behavior. First, instead of writing ELBO for a single data point, we write it as an expectation over a dataset: We can expand the KL term as Eq. 2 (derivations in Appendix A.1) and rewrite ELBO as: and I(Z, X) is the mutual information between Z and X. This expansion shows that the KL term in ELBO is trying to reduce the mutual information between latent variables and the input data, which explains why VAEs often ignore the latent variable, especially when equipped with powerful decoders.

VAE with Information Maximization and Batch Prior Regularization
A natural solution to correct the anti-information issue in Eq. 3 is to maximize both the data likeli-hood lowerbound and the mutual information between z and the input data: Therefore, jointly optimizing ELBO and mutual information simply cancels out the informationdiscouraging term. Also, we can still sample from the prior distribution for generation because of KL(q(z) p(z)). Eq. 4 is similar to the objectives used in adversarial autoencoders (Makhzani et al., 2015;Kim et al., 2017). Our derivation provides a theoretical justification to their superior performance. Notably, Eq. 4 arrives at the same loss function proposed in infoVAE (Zhao S et al., 2017). However, our derivation is different, offering a new way to understand ELBO behavior.
The remaining challenge is how to minimize KL(q(z) p(z)), since q(z) is an expectation over q(z|x). When z is continuous, prior work has used adversarial training (Makhzani et al., 2015;Kim et al., 2017) or Maximum Mean Discrepancy (MMD) (Zhao S et al., 2017) to regularize q(z). It turns out that minimizing KL(q(z) p(z)) for discrete z is much simpler than its continuous counterparts. Let x n be a sample from a batch of N data points. Then we have: where q (z) is a mixture of softmax from the posteriors q(z|x n ) of each x n . We can approximate KL(q(z) p(z)) by: We refer to Eq. 6 as Batch Prior Regularization (BPR). When N approaches infinity, q (z) approaches the true marginal distribution of q(z).
In practice, we only need to use the data from each mini-batch assuming that the mini batches are randomized. Last, BPR is fundamentally different from multiplying a coefficient < 1 to anneal the KL term in VAE (Bowman et al., 2015). This is because BPR is a non-linear operation log sum exp. For later discussion, we denote our discrete infoVAE with BPR as DI-VAE.

Learning Sentence Representations from the Context
DI-VAE infers sentence representations by reconstruction of the input sentence. Past research in distributional semantics has suggested the meaning of language can be inferred from the adjacent context (Harris, 1954;Hill et al., 2016). The distributional hypothesis is especially applicable to dialog since the utterance meaning is highly contextual. For example, the dialog act is a wellknown utterance feature and depends on dialog state (Austin, 1975;Stolcke et al., 2000). Thus, we introduce a second type of latent action based on sentence-level distributional semantics. Skip thought (ST) is a powerful sentence representation that captures contextual information (Kiros et al., 2015). ST uses an RNN to encode a sentence, and then uses the resulting sentence representation to predict the previous and next sentences. Inspired by ST's robust performance across multiple tasks (Hill et al., 2016), we adapt our DI-VAE to Discrete Information Variational Skip Thought (DI-VST) to learn discrete latent actions that model distributional semantics of sentences. We use the same recognition network from DI-VAE to output z's posterior distribution q R (z|x). Given the samples from q R (z|x), two RNN generators are used to predict the previous sentence x p and the next sentences x n . Finally, the learning objective is to maximize:

Integration with Encoder Decoders
We now describe how to integrate a given q R (z|x) with an encoder decoder and a policy network. Let the dialog context c be a sequence of utterances. Then a dialog context encoder network can encode the dialog context into a distributed representation h e = F e (c). The decoder F d can generate the responsesx = F d (h e , z) using samples from q R (z|x). Meanwhile, we train π to predict the aggregated posterior E p(x|c) [q R (z|x)] from c via maximum likelihood training. This model is referred as Latent Action Encoder Decoder (LAED) with the following objective.
Also simply augmenting the inputs of the decoders with latent action does not guarantee that the generated response exhibits the attributes of the give action. Thus we use the controllable text generation framework (Hu et al., 2017) by introducing L Attr , which reuses the same recognition network q R (z|x) as a fixed discriminator to penalize the decoder if its generated responses do not reflect the attributes in z.
Since it is not possible to propagate gradients through the discrete outputs at F d at each word step, we use a deterministic continuous relaxation (Hu et al., 2017) by replacing output of F d with the probability of each word. Let o t be the normalized probability at step t ∈ [1, |x|], the inputs to q R at time t are then the sum of word embeddings weighted by o t , i.e. h R t = RNN(h R t−1 , Eo t ) and E is the word embedding matrix. Finally this loss is combined with L LAED and a hyperparameter λ to have Attribute Forcing LAED.

Relationship with Conditional VAEs
It is not hard to see L LAED is closely related to the objective of CVAEs for dialog generation (Serban et al., 2016;, which is: Despite their similarities, we highlight the key differences that prohibit CVAE from achieving interpretable dialog generation. First L CVAE encourages I(x, z|c) (Agakov, 2005), which learns z that capture context-dependent semantics. More intuitively, z in CVAE is trained to generate x via p(x|z, c) so the meaning of learned z can only be interpreted along with its context c. Therefore this violates our goal of learning context-independent semantics. Our methods learn q R (z|x) that only depends on x and trains q R separately to ensure the semantics of z are interpretable standalone.

Experiments and Results
The proposed methods are evaluated on four datasets. The first corpus is Penn Treebank (PTB) (Marcus et al., 1993) used to evaluate sentence VAEs (Bowman et al., 2015). We used the version pre-processed by Mikolov (Mikolov et al., 2010). The second dataset is the Stanford Multi-Domain Dialog (SMD) dataset that contains 3,031 human-Woz, task-oriented dialogs collected from 3 different domains (navigation, weather and scheduling) (Eric and Manning, 2017). The other two datasets are chat-oriented data: Daily Dialog (DD) and Switchboard (SW) (Godfrey and Holliman, 1997), which are used to test whether our methods can generalize beyond task-oriented dialogs but also to to open-domain chatting. DD contains 13,118 multi-turn human-human dialogs annotated with dialog acts and emotions. (Li et al., 2017). SW has 2,400 human-human telephone conversations that are annotated with topics and dialog acts. SW is a more challenging dataset because it is transcribed from speech which contains complex spoken language phenomenon, e.g. hesitation, self-repair etc.

Comparing Discrete Sentence Representation Models
The first experiment used PTB and DD to evaluate the performance of the proposed methods in learning discrete sentence representations. We implemented DI-VAE and DI-VST using GRU-RNN (Chung et al., 2014) and trained them using Adam (Kingma and Ba, 2014). Besides the proposed methods, the following baselines are compared. Unregularized models: removing the KL(q|p) term from DI-VAE and DI-VST leads to a simple discrete autoencoder (DAE) and discrete skip thought (DST) with stochastic discrete hidden units. ELBO models: the basic discrete sentence VAE (DVAE) or variational skip thought (DVST) that optimizes ELBO with regularization term KL(q(z|x) p(z)). We found that standard training failed to learn informative latent actions for either DVAE or DVST because of the posterior collapse. Therefore, KL-annealing (Bowman et al., 2015) and bag-of-word loss  are used to force these two models learn meaningful representations. We also include the results for VAE with continuous latent variables reported on the same PTB . Additionally, we report the perplexity from a standard GRU-RNN language model (Zaremba et al., 2014). The evaluation metrics include reconstruction perplexity (PPL), KL(q(z) p(z)) and the mutual information between input data and latent vari-ables I(x, z). Intuitively a good model should achieve low perplexity and KL distance, and simultaneously achieve high I(x, z). The discrete latent space for all models are M =20 and K=10.  Table 1 shows that all models achieve better perplexity than an RNNLM, which shows they manage to learn meaningful q(z|x). First, for autoencoding models, DI-VAE is able to achieve the best results in all metrics compared other methods. We found DAEs quickly learn to reconstruct the input but they are prone to overfitting during training, which leads to lower performance on the test data compared to DI-VAE. Also, since there is no regularization term in the latent space, q(z) is very different from the p(z) which prohibits us from generating sentences from the latent space. In fact, DI-VAE enjoys the same linear interpolation properties reported in (Bowman et al., 2015) (See Appendix A.2). As for DVAEs, it achieves zero I(x, z) in standard training and only manages to learn some information when training with KL-annealing and bag-of-word loss. On the other hand, our methods achieve robust performance without the need for additional processing. Similarly, the proposed DI-VST is able to achieve the lowest PPL and similar KL compared to the strongly regularized DVST. Interestingly, although DST is able to achieve the highest I(x, z), but PPL is not further improved. These results confirm the effectiveness of the proposed BPR in terms of regularizing q(z) while learning meaningful posterior q(z|x).
In order to understand BPR's sensitivity to batch size N , a follow-up experiment varied the batch size from 2 to 60 (If N =1, DI-VAE is equivalent to DVAE). Figure 2 show that as N increases, perplexity, I(x, z) monotonically improves, while KL(q p) only increases from 0 to 0.159. After N > 30, the performance plateaus. Therefore, using mini-batch is an efficient trade-off between q(z) estimation and computation speed.
The last experiment in this section investigates the relation between representation learning and the dimension of the latent space. We set a fixed budget by restricting the maximum number of modes to be about 1000, i.e. K M ≈ 1000. We then vary the latent space size and report the same evaluation metrics. Table 2 shows that models with multiple small latent variables perform significantly better than those with large and few latent variables.

Interpreting Latent Actions
The next question is to interpret the meaning of the learned latent action symbols. To achieve this, the latent action of an utterance x n is obtained from a greedy mapping: a n = argmax k q R (z = k|x n ).
We set M =3 and K=5, so that there are at most 125 different latent actions, and each x n can now be represented by a 1 -a 2 -a 3 , e.g. "How are you?" → 1-4-2. Assuming that we have access to manually clustered data according to certain classes (e.g. dialog acts), it is unfair to use classic cluster measures (Vinh et al., 2010) to evaluate the clusters from latent actions. This is because the uniform prior p(z) evenly distribute the data to all possible latent actions, so that it is expected that frequent classes will be assigned to several latent actions. Thus we utilize the homogeneity metric (Rosenberg and Hirschberg, 2007) that measures if each latent action contains only members of a single class. We tested this on the SW and DD, which contain human annotated features and we report the latent actions' homogeneity w.r.t these features in Table 3. On DD, results show DI-VST SW DD Act Topic Act Emotion DI-VAE 0.48 0.08 0.18 0.09 DI-VST 0.33 0.13 0.34 0.12 works better than DI-VAE in terms of creating actions that are more coherent for emotion and dialog acts. The results are interesting on SW since DI-VST performs worse on dialog acts than DI-VAE. One reason is that the dialog acts in SW are more fine-grained (42 acts) than the ones in DD (5 acts) so that distinguishing utterances based on words in x is more important than the information in the neighbouring utterances. We then apply the proposed methods to SMD which has no manual annotation and contains taskoriented dialogs. Two experts are shown 5 randomly selected utterances from each latent action and are asked to give an action name that can describe as many of the utterances as possible. Then an Amazon Mechanical Turk study is conducted to evaluate whether other utterances from the same latent action match these titles. 5 workers see the action name and a different group of 5 utterances from that latent action. They are asked to select all utterances that belong to the given actions, which tests the homogeneity of the utterances falling in the same cluster. Negative samples are included to prevent random selection. Table 4 shows that both methods work well and DI-VST achieved better homogeneity than DI-VAE.
Since DI-VAE is trained to reconstruct its input and DI-VST is trained to model the context, they group utterances in different ways. For example, DI-VST would group "Can I get a restaurant", "I am looking for a restaurant" into one action where

Dialog Response Generation with Latent Actions
Finally we implement an LAED as follows. The encoder is a hierarchical recurrent encoder (Serban et al., 2016) with bi-directional GRU-RNNs as the utterance encoder and a second GRU-RNN as the discourse encoder. The discourse encoder output its last hidden state h e |x| . The decoder is another GRU-RNN and its initial state of the decoder is obtained by h d 0 = h e |x| + M m=1 e m (z m ), where z comes from the recognition network of the proposed methods. The policy network π is a 2-layer multi-layer perceptron (MLP) that models p π (z|h e |x| ). We use up to the previous 10 utterances as the dialog context and denote the LAED using DI-VAE latent actions as AE-ED and the one uses DI-VST as ST-ED.
First we need to confirm whether an LAED can generate responses that are consistent with the semantics of a given z. To answer this, we use a pre-trained recognition network R to check if a generated response carries the attributes in the given action. We generate dialog responses on a test dataset viax = F(z ∼ π(c), c) with greedy RNN decoding. The generated responses are passed into the R and we measure attribute accuracy by countingx as correct if z = argmax k q R (k|x).  Table 6: Results for attribute accuracy with and without attribute loss.
responses are highly consistent with the given latent actions. Also, latent actions from DI-VAE achieve higher attribute accuracy than the ones from DI-VST, because z from auto-encoding is explicitly trained for x reconstruction. Adding L attr is effective in forcing the decoder to take z into account during its generation, which helps the most in more challenging open-domain chatting data, e.g. SW and DD. The accuracy of ST-ED on SW is worse than the other two datasets. The reason is that SW contains many short utterances that can be either a continuation of the same speaker or a new turn from the other speaker, whereas the responses in the other two domains are always followed by a different speaker. The more complex context pattern in SW may require special treatment. We leave it for future work. The second experiment checks if the policy network π is able to predict the right latent action given just the dialog context. We report both accuracy, i.e. argmax k q R (k|x) = argmax k p π (k |c) and perplexity of p π (z|c). The perplexity measure is more useful for open domain dialogs because decision-making in complex dialogs is often one-to-many given a similar context . the three dialog datasets. These scores provide useful insights to understand the complexity of a dialog dataset. For example, accuracy on opendomain chatting is harder than the task-oriented SMD data. Also, it is intuitive that predicting system actions is easier than predicting user actions on SMD. Also, in general the prediction scores for ST-ED are higher the ones for AE-ED. The reason is related to our previous discussion about the granularity of the latent actions. Since latent actions from DI-VST mainly model the the type of utterances used in certain types of context, it is easier for the policy network to predict latent actions from DI-VST. Therefore, choosing the type of latent actions is a design choice and depends on the type of interpretability that is needed. We finish with an example generated from the two variants of LAED on SMD as shown in Table 8. Given a dialog context, our systems are able to output a probability distribution over different latent actions that have interpretable meaning along with their natural language realizations.  Table 8: Interpretable dialog generation on SMD with top probable latent actions. AE-ED predicts more fine-grained but more error-prone actions.

Conclusion and Future Work
This paper presents a novel unsupervised framework that enables the discovery of discrete latent actions and interpretable dialog response generation. Our main contributions reside in the two sentence representation models DI-VAE and DI-VST, and their integration with the encoder decoder models. Experiments show the proposed methods outperform strong baselines in learning discrete latent variables and showcase the effectiveness of interpretable dialog response generation. Our findings also suggest promising future research directions, including learning better context-based latent actions and using reinforce-ment learning to adapt policy networks. We believe that this work is an important step forward towards creating generative dialog models that can not only generalize to large unlabelled datasets in complex domains but also be explainable to human users.