Conditional Generation and Snapshot Learning in Neural Dialogue Systems

Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential signals by applying a companion cross-entropy objective function to the conditioning vector. The experimental and analytical results demonstrate firstly that competition occurs between the conditioning vector and the LM, and the differing architectures provide different trade-offs between the two. Secondly, the discriminative power and transparency of the conditioning vector is key to providing both model interpretability and better performance. Thirdly, snapshot learning leads to consistent performance improvements independent of which architecture is used.


Introduction
Recurrent Neural Network (RNN)-based conditional language models (LM) have been shown to be very effective in solving a number of real world problems, such as machine translation (MT) (Cho et al., 2014) and image caption generation (Karpathy and Fei-Fei, 2015).Recently, RNNs were applied to task of generating sentences from an explicit semantic representation (Wen et al., 2015a).Attention-based methods (Mei et al., 2016) and Long Short-term Memory (LSTM)-like (Hochreiter and Schmidhuber, 1997) gating mechanisms (Wen et al., 2015b) have both been studied to improve generation quality.Although it is now clear that LSTMbased conditional LMs can generate plausible natural language, less effort has been put in comparing the different model architectures.Furthermore, conditional generation models are typically tested on relatively straightforward tasks conditioned on a single source (e.g. a sentence or an image) and where the goal is to optimise a single metric (e.g.BLEU).In this work, we study the use of conditional LSTMs in the generation component of neural network (NN)-based dialogue systems which depend on multiple conditioning sources and optimising multiple metrics.
Neural conversational agents (Vinyals and Le, 2015;Shang et al., 2015) are direct extensions of the sequence-to-sequence model (Sutskever et al., 2014) in which a conversation is cast as a source to target transduction problem.However, these models are still far from real world applications because they lack any capability for supporting domain specific tasks, for example, being able to interact with databases (Sukhbaatar et al., 2015;Yin et al., 2016) and aggregate useful information into their responses.Recent work by Wen et al. (2016a), however, proposed an end-to-end trainable neural dialogue system that can assist users to complete specific tasks.Their system used both distributed and symbolic representations to capture user intents, and these collectively condition a NN language generator to generate system responses.Due to the diversity of the conditioning information sources, the best way to represent and combine them is non-trivial.
In Wen et al. (2016a), the objective function for learning the dialogue policy and language generator depends solely on the likelihood of the output sentences.However, this sequential supervision signal may not be informative enough to learn a good conditioning vector representation resulting in a generation process which is dominated by the LM.This can often lead to inappropriate system outputs.
In this paper, we therefore also investigate the use of snapshot learning which attempts to mitigate this problem by heuristically applying companion supervision signals to a subset of the conditioning vector.This idea is similar to deeply supervised nets (Lee et al., 2015) in which the final cost from the output layer is optimised together with the companion signals generated from each intermediary layer.We have found that snapshot learning offers several benefits: (1) it consistently improves performance; (2) it learns discriminative and robust feature representations and alleviates the vanishing gradient problem; (3) it appears to learn transparent and interpretable subspaces of the conditioning vector.

Related Work
Machine learning approaches to task-oriented dialogue system design have cast the problem as a partially observable Markov Decision Process (POMDP) (Young et al., 2013) with the aim of using reinforcement learning (RL) to train dialogue policies online through interactions with real users (Gašić et al., 2013).In order to make RL tractable, the state and action space must be carefully designed (Young et al., 2010) and the understanding (Henderson et al., 2014;Mrkšić et al., 2015) and generation (Wen et al., 2015b;Wen et al., 2016b) modules were assumed available or trained standalone on supervised corpora.Due to the underlying hand-coded semantic representation (Traum, 1999), the conversation is far from natural and the comprehension capability is limited.This motivates the use of neural networks to model dialogues from end to end as a conditional generation problem.
Interest in generating natural language using NNs can be attributed to the success of RNN LMs for large vocabulary speech recognition (Mikolov et al., 2010;Mikolov et al., 2011).Sutskever et al. (2011) showed that plausible sentences can be obtained by sampling characters one by one from the output layer of an RNN.By conditioning an LSTM on a sequence of characters, Graves (2013) showed that machines can synthesise handwriting indistinguishable from that of a human.Later on, this conditional generation idea has been tried in several research fields, for example, generating image captions by conditioning an RNN on a convolutional neural network (CNN) output (Karpathy and Fei-Fei, 2015;Xu et al., 2015); translating a source to a target language by conditioning a decoder LSTM on top of an encoder LSTM (Cho et al., 2014;Bahdanau et al., 2015); or generating natural language by conditioning on a symbolic semantic representation (Wen et al., 2015b;Mei et al., 2016).Among all these methods, attention-based mechanisms (Bahdanau et al., 2015;Hermann et al., 2015;Ling et al., 2016) have been shown to be very effective improving performance using a dynamic source aggregation strategy.
To model dialogue as conditional generation, a sequence-to-sequence learning (Sutskever et al., 2014) framework has been adopted.Vinyals and Le (2015) trained the same model on several conversation datasets and showed that the model can generate plausible conversations.However, Serban et al. (2015b) discovered that the majority of the generated responses are generic due to the maximum likelihood criterion, which was latter addressed by Li et al. (2016a) using a maximum mutual information decoding strategy.Furthermore, the lack of a consistent system persona was also studied in Li et al. (2016b).Despite its demonstrated potential, a major barrier for this line of research is data collection.Many works (Lowe et al., 2015;Serban et al., 2015a;Dodge et al., 2016) have investigated conversation datasets for developing chat bot or QA-like general purpose conversation agents.However, collecting data to develop goal oriented dialogue systems that can help users to complete a task in a specific domain remains difficult.In a recent work by Wen et al. (2016a), this problem was addressed by designing an online, parallel version of Wizard-of-Oz data collection (Kelley, 1984) which allows large scale and cheap in-domain conversation data to be collected using Amazon Mechanical Turk.An NNbased dialogue model was also proposed to learn from the collected dataset and was shown to be able to assist human subjects to complete specific tasks.
Snapshot learning can be viewed as a special form of weak supervision (also known as distant-or self supervision) (Craven and Kumlien, 1999;Snow et al., 2004), in which supervision signals are heuristically labelled by matching unlabelled corpora with entities or attributes in a structured database.It has been widely applied to relation extraction (Mintz et al., 2009) and information extraction (Hoffmann et al., 2011) in which facts from a knowledge base (e.g.Freebase) were used as objectives to train classifiers.Recently, self supervision was also used in memory networks (Hill et al., 2016) to improve the discriminative power of memory attention.Conceptually, snapshot learning is related to curriculum learning (Bengio et al., 2009).Instead of learning easier examples before difficult ones, snapshot learning creates an easier target for each example.In practice, snapshot learning is similar to deeply supervised nets (Lee et al., 2015) in which companion objectives are generated from intermediary layers and optimised altogether with the output objective.

Neural Dialogue System
The testbed for this work is a neural network-based task-oriented dialogue system proposed by Wen et al. (2016a).The model casts dialogue as a source to target sequence transduction problem (modelled by a sequence-to-sequence architecture (Sutskever et al., 2014)) augmented with the dialogue history (modelled by a belief tracker (Henderson et al., 2014)) and the current database search outcome (modelled by a database operator).The model consists of both encoder and decoder modules.The details of each module are given below.

Encoder Module
At each turn t, the goal of the encoder is to produce a distributed representation of the system action m t , which is then used to condition a decoder to generate the next system response in skeletal form1 .It consists of four submodules: intent network, belief tracker, database operator, and policy network.Intent Network The intent network takes a sequence of tokens 1 and converts it into a sentence embedding representing the user intent using an LSTM network.The hidden layer of the LSTM at the last encoding step z t is taken as the representation.As mentioned in Wen et al. (2016a), this representation can be viewed as a distributed version of the speech act (Traum, 1999) used in traditional systems.Belief Trackers In addition to the intent network, the neural dialogue system uses a set of slot-based belief trackers (Henderson et al., 2014;Mrkšić et al., 2015) to track user requests.By taking each user input as new evidence, the task of a belief tracker is to maintain a multinomial distribution p over values v ∈ V s for each informable slot s2 , and a binary distribution for each requestable slot 2 .These probability distributions p s t are called belief states of the system.The belief states p s t , together with the intent vector z t , can be viewed as the system's comprehension of the user requests up to turn t.Database Operator Based on the belief states p s t , a DB query is formed by taking the union of the maximum values of each informable slot.A vector x t representing different degrees of matching in the DB (no match, 1 match, ... or more than 5 matches) is produced by counting the number of matched entities and expressing it as a 6-bin 1-hot encoding.If x t is not zero, an associated entity pointer is maintained which identifies one of the matching DB entities selected at random.The entity pointer is updated if the current entity no longer matches the search criteria; otherwise it stays the same.Policy Network Based on the vectors z t , p s t , and x t from the above three modules, the policy network combines them into a single action vector m t by a three-way matrix transformation, where matrices W zm , W s pm , and W xm are parameters and G is the domain ontology.

Decoder Module
Conditioned on the system action vector m t provided by the encoder module, the decoder module uses a conditional LSTM LM to generate the required system output token by token in skeletal form 1 .The final system response can then be formed by substituting the actual values of the database entries into the skeletal sentence structure.

Conditional Generation Network
In this paper we study and analyse three different variants of LSTM-based conditional generation architectures: Language Model Type The most straightforward way to condition the LSTM network on additional source information is to concatenate the conditioning vector m t together with the input word embedding w j and previous hidden layer where index j is the generation step, n is the hidden layer size, i j , f j , o j ∈ [0, 1] n are input, forget, and output gates respectively, ĉj and c j are proposed cell value and true cell value at step j, and W 4n,3n are the model parameters.1b, in which the conditioning vector m t is governed by a standalone reading gate r j .This reading gate decides how much information should be read from the conditioning vector and directly writes it into the memory cell c j , where W c is another weight matrix to learn.The idea behind this is that the model isolates the conditioning vector from the LM so that the model has more flexibility to learn to trade off between the two.Hybrid Type Continuing with the same idea as the memory type network, a complete separation of conditioning vector and LM (except for the gate controlling the signals) is provided by the hybrid type network shown in Figure 1c, This model was motivated by the fact that long-term dependency is not needed for the conditioning vector because we apply this information at every step j anyway.The decoupling of the conditioning vector and the LM is attractive because it leads to better interpretability of the results and provides the potential to learn a better conditioning vector and LM.

Attention and Belief Representation
Attention An attention-based mechanism provides an effective approach for aggregating multiple information sources for prediction tasks.Like Wen et al.
(2016a), we explore the use of an attention mechanism to combine the tracker belief states in which the policy network in Equation 1 is modified as where the attention weights α j s are calculated by, where v t = z t + x t and matrix W r and vector r are parameters to learn.
Belief Representation The effect of different belief state representations on the end performance are also studied.For user informable slots, the full belief state p s t is the original state containing all categorical values; the summary belief state contains only three components: the summed value of all categorical probabilities, the probability that the user said they "don't care" about this slot and the probability that the slot has not been mentioned.For user requestable slots, on the other hand, the full belief state is the same as the summary belief state because the slot values are binary rather than categorical.

Snapshot Learning
Learning conditional generation models from sequential supervision signals can be difficult, because it requires the model to learn both long-term word dependencies and potentially distant source encoding functions.To mitigate this difficulty, we introduce a novel method called snapshot learning to create a vector of binary labels Υ j t ∈ [0, 1] d , d < dim(m j t ) as the snapshot of the remaining part of the output sentence T t,j:|Tt| from generation step j.Each element of the snapshot vector is an indicator function of a certain event that will happen in the future, which can be obtained either from the system response or dialogue context at training time.A companion cross entropy error is then computed to force a subset of the conditioning vector mj t ⊂ m j t to be close to the snapshot vector, (2) where H(•) is the cross entropy function, Υ j t and mj t are elements of vectors Υ j t and mj t , respectively.In order to make the tanh activations of mj t compatible with the 0-1 snapshot labels, we squeeze each Figure 2: The idea of snapshot learning.The snapshot vector was trained with additional supervisions on a set of indicator functions heuristically labelled using the system response.value of mj t by adding 1 and dividing by 2 before computing the cost.
The indicator functions we use in this work have two forms: (1) whether a particular slot value (e.g., [v.food] 1 ) is going to occur, and (2) whether the system has offered a venue3 , as shown in Figure 2. The offer label in the snapshot is produced by checking the delexicalised name token ([v.name]) in the entire dialogue.If it has occurred, every label in subsequent turns is labelled with 1. Otherwise it is labelled with 0. To create snapshot targets for a particular slot value, the output sentence is matched with the corresponding delexicalised token turn by turn, per generation step.At each generation step, the target is labelled with 0 if that delexicalised token has been generated; otherwise it is set to 1.However, for the models without attention, the targets per turn are set to the same because the condition vector will not be able to learn the dynamically changing behaviour without attention.

Experiments
Dataset The dataset used in this work was collected in the Wizard-of-Oz online data collection described by Wen et al. (2016a), in which the task of the system is to assist users to find a restaurant in Cambridge, UK area.There are three informable slots (food, pricerange, area) that users can use to constrain the search and six requestable slots (address, phone, postcode plus the three informable Table 1: Performance comparison of different model architectures, belief state representations, and snapshot learning.The numbers to the left and right of the / sign are learning without and with snapshot, respectively.The model with the best performance on a particular metric (column) is shown in bold face.slots) that the user can ask a value for once a restaurant has been offered.There are 676 dialogues in the dataset (including both finished and unfinished dialogues) and approximately 2750 turns in total.The database contains 99 unique restaurants.Training The training procedure was divided into two stages.Firstly, the belief tracker parameters θ b were pre-trained using cross entropy errors between tracker labels and predictions.Having fixed the tracker parameters, the remaining parts of the model θ \b are trained using the cross entropy errors from the generation network LM, where y t j and p t j are output token targets and predictions respectively, at turn t of output step j, L ss (•) is the snapshot cost from Equation 2, and λ is the tradeoff parameter in which we set to 1 for all models trained with snapshot learning.We treated each dialogue as a batch and used stochastic gradient descent with a small l2 regularisation term to train the model.The collected corpus was partitioned into a training, validation, and testing sets in the ratio 3:1:1.Early stopping was implemented based on the validation set considering only LM log-likelihoods.Gradient clipping was set to 1.The hidden layer sizes were set to 50, and the weights were randomly initialised between -0.3 and 0.3 including word embeddings.The vocabulary size is around 500 for both input and output, in which rare words and words that can be delexicalised have been removed.
Decoding In order to compare models trained with different recipes rather than decoding strategies, we decode all the trained models with the average log probability of tokens in the sentence.We applied beam search with a beamwidth equal to 10, the search stops when an end-of-sentence token is generated.In order to consider language variability, we ran decoding until 5 candidates were obtained and performed evaluation on them.
Metrics We compared models trained with different recipes by performing a corpus-based evaluation in which the model is used to predict each system response in the held-out test set.Three evaluation metrics were used: BLEU score (on top-1 and top-5 candidates) (Papineni et al., 2002), slot matching rate and objective task success rate (Su et al., 2015).The dialogue is marked as successful if both: (1) the offered entity matches the task that was specified to the user, and (2) the system answered all the associated information requests (e.g.what is the address?) from the user.The slot matching rate is the percentage of delexicalised tokens (e.g.[s.food] and [v.area] 1 ) appear in the candidate also appear in the reference.We computed the BLEU scores on the skeletal sentence forms before substituting with the actual entity values.All the results were averaged over 10 random initialised networks.
Results Table 1 shows the evaluation results.The numbers to the left and right of each table cell are the same model trained w/o and w/ snapshot learning.The first observation is that snapshot learning consistently improves on most metrics regardless of the model architecture.This is especially true for BLEU scores.We think this may be attributed to the more discriminative conditioning vector learned through the snapshot method, which makes the learning of the conditional LM easier.
In the first block belief state representation, we compare the effect of two different belief representations.As can be seen, using a succinct representation is better (summary>full) because the identity of each categorical value in the belief state does not help when the generation decisions are done in skeletal form.In fact, the full belief state representation may encourage the model to learn incorrect coadaptation among features when the data is scarce.
In the conditional architecture block, we compare the three different conditional generation architectures as described in section 3.2.1.This result shows that the language model type (lm) and memory type (mem) networks perform better in terms of BLEU score and slot matching rate, while the hybrid type (hybrid) networks achieve higher task success.This is probably due to the degree of separation between the LM and conditioning vector: a coupling approach (lm, mem) sacrifices the conditioning vector but learns a better LM and higher BLEU; while a complete separation (hybrid) learns a better conditioning vector and offers a higher task success.
Lastly, in the attention-based model block we train the three architectures with the attention mech- anism and compare them again.Firstly, the characteristics of the three models we observed above also hold for attention-based models.Secondly, we found that the attention mechanism improves all the three architectures on task success rate but not BLEU scores.This is probably due to the limitations of using n-gram based metrics like BLEU to evaluate the generation quality (Stent et al., 2005).

Model Analysis
Gate Activations We first studied the average activation of each individual gate in the models by averaging them when running generation on the test set.We analysed the hybrid models because their reading gate to output gate activation ratio (r j /o j ) shows clear tradeoff between the LM and the conditioning vector components.As can be seen in Table 2, we found that the average forget gate activations (f j ) and the ratio of the reading gate to the output gate activation (r j /o j ) have strong correlations to performance: a better performance (row 3>row 2>row 1) seems to come from models that can learn a longer word dependency (higher forget gate f t activations) and a better conditioning vector (therefore Each line represents a neuron that detects a particular snapshot event. higher reading to output gate ratio r j /o j ).

Learned Attention
We have visualised the learned attention heat map of models trained with and without snapshot learning in Figure 3.The attention is on both the informable slot trackers (first three columns) and the requestable slot trackers (the other columns).We found that the model trained with snapshot learning (Figure 3b) seems to produce a more accurate and discriminative attention heat map comparing to the one trained without it (Figure 3a).This may contribute to the better performance achieved by the snapshot learning approach.
Snapshot Neurons As mentioned earlier, snapshot learning forces a subspace of the conditioning vector mj t to become discriminative and interpretable.Two example generated sentences together with the snapshot neuron activations are shown in Figure 4.As can be seen, when generating words one by one, the neuron activations were changing to detect different events they were assigned by the snapshot training signals: e.g. in Figure 4b the light blue and orange neurons switched their domination role when the token [v.address] was generated; the offered neuron is in a high activation state in Figure 4b because the system was offering a venue, while in Figure 4a it is not activated because the system was still helping the user to find a venue.More examples can be found in the Appendix.

Conclusion and Future Work
This paper has investigated different conditional generation architectures and a novel method called snapshot learning to improve response generation in a neural dialogue system framework.The results showed three major findings.Firstly, although the hybrid type model did not rank highest on all metrics, it is nevertheless preferred because it achieved the highest task success and also it provided more interpretable results.Secondly, snapshot learning provided gains on virtually all metrics regardless of the architecture used.The analysis suggested that the benefit of snapshot learning mainly comes from the more discriminative and robust subspace representation learned from the heuristically labelled companion signals, which in turn facilitates optimisation of the final target objective.Lastly, the results suggested that by making a complex system more interpretable at different levels not only helps our understanding but also leads to the highest success rates.
However, there is still much work left to do.This work focused on conditional generation architectures and snapshot learning in the scenario of generating dialogue responses.It would be very helpful if the same comparison could be conducted in other application domains such as machine translation or image caption generation so that a wider view of the effectiveness of these approaches can be assessed.
Figure 1: Three different conditional generation architectures.
Figure 3: Learned attention heat maps over trackers.The first three columns in each figure are informable slot trackers and the rest are requestable slot trackers.The generation model is the hybrid type LSTM.

Figure 4 :
Figure 4: Two example responses generated from the hybrid model trained with snapshot and attention.Each line represents a neuron that detects a particular snapshot event.
The model is shown in Figure 1a.Since it does not differ significantly from the original LSTM, we call it the language model type (lm) conditional generation network.

Table 2 :
Average activation of gates on test set.