A Context-aware Natural Language Generator for Dialogue Systems

We present a novel natural language generation system for spoken dialogue systems capable of entraining (adapting) to users' way of speaking, providing contextually appropriate responses. The generator is based on recurrent neural networks and the sequence-to-sequence approach. It is fully trainable from data which include preceding context along with responses to be generated. We show that the context-aware generator yields significant improvements over the baseline in both automatic metrics and a human pairwise preference test.


Introduction
In a conversation, speakers are influenced by previous utterances of their counterparts and tend to adapt (align, entrain) their way of speaking to each other, reusing lexical items as well as syntactic structure (Reitter et al., 2006).Entrainment occurs naturally and subconsciously, facilitates successful conversations (Friedberg et al., 2012;Nenkova et al., 2008), and forms a natural source of variation in dialogues.In spoken dialogue systems (SDS), users were reported to entrain to system prompts (Parent and Eskenazi, 2010).
The function of natural language generation (NLG) components in task-oriented SDS typically is to produce a natural language sentence from a dialogue act (DA) (Young et al., 2010) representing an action, such as inform or request, along with one or more attributes (slots) and their values (see Fig. 1).NLG is an important component of SDS which has a great impact on the perceived naturalness of the system; its quality can also influence the overall task success (Stoyanchev and Stent, 2009;Lopes et al., 2013)  NLG systems in SDS only take the input DA into account and have no way of adapting to the user's way of speaking.To avoid repetition and add variation into the outputs, they typically alternate between a handful of preset variants (Jurčíček et al., 2014) or use overgeneration and random sampling from a k-best list of outputs (Wen et al., 2015b).
There have been several attempts at introducing entrainment into NLG in SDS, but they are limited to rule-based systems (see Section 4).We present a novel, fully trainable contextaware NLG system for SDS that is able to entrain to the user and provides naturally variable outputs because generation is conditioned not only on the input DA, but also on the preceding user utterance (see Fig. 1).Our system is an extension of Dušek and Jurčíček (2016b)'s generator based on sequence-to-sequence (seq2seq) models with attention (Bahdanau et al., 2015).It is, to our knowledge, the first fully trainable entrainment-enabled NLG system for SDS.We also present our first results on the dataset of Dušek and Jurčíček (2016a), which includes the preceding user utterance along with each data instance (i.e., pair of input meaning representation and output sentence), and we show that our context-aware system outperforms the baseline in both automatic metrics and a human pairwise preference test.
In the following, we first present the architecture of our generator (see Section 2), then give an account of our experiments in Section 3. We include a brief survey of related work in Section 4. Section 5 contains concluding remarks and plans for future work.

Our generator
Our seq2seq generator is an improved version of Dušek and Jurčíček (2016b)'s generator, which itself is based on the seq2seq model with attention (Bahdanau et al., 2015, see Fig. 2) as implemented in the TensorFlow framework (Abadi et al., 2015). 1 We first describe the base model in Section 2.1, then list our context-aware improvements in Section 2.2.

Baseline Seq2seq NLG with Attention
The generation has two stages: The first, encoder stage uses a recurrent neural network (RNN) composed of long-short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997;Graves, 2013) to encode a sequence of input tokens 2 x = {x 1 , . . ., x n } into a sequence of hidden states h = {h 1 , . . ., h n }: The second, decoder stage then uses the hidden states h to generate the output sequence y = {y 1 , . . ., y m }.Its main component is a second LSTM-based RNN, which works over its own internal state s t and the previous output token y t−1 : It is initialized by the last hidden encoder state (s 0 = h n ) and a special starting symbol.The generated output token y t is selected from a softmax distribution: In ( 2) and (3), c t represents the attention model -a sum over all encoder hidden states, weighted by a feed-forward network with one tanh hidden layer; W S and W Y are linear projection matrices and "•" denotes concatenation.
DAs are represented as sequences on the encoder input: a triple of the structure "DA type, slot, value" is created for each slot in the DA and the triples are concatenated (see Fig. 2). 3 The generator supports greedy decoding as well as beam search which keeps track of top k most probable output sequences at each time step (Sutskever et al., 2014;Bahdanau et al., 2015).
The generator further features a simple content classification reranker to penalize irrelevant or missing information on the output.It uses an LSTM-based RNN to encode the generator outputs token-by-token into a fixed-size vector.This is then fed to a sigmoid classification layer that outputs a 1-hot vector indicating the presence of all possible DA types, slots, and values.The vectors for all k-best generator outputs are then compared to the input DA and the number of missing and irrelevant elements is used to rerank them.

Making the Generator Context-aware
We implemented three different modifications to our generator that make its output dependent on the preceding context:4 Prepending context.The preceding user utterance is simply prepended to the DA and fed into the encoder (see Fig. 2).The dictionary for context utterances is distinct from the DA tokens dictionary.
Context encoder.We add another, separate encoder for the context utterances.The hidden states of both encoders are concatenated, and the decoder then works with double-sized vectors both on the input and in the attention model (see Fig. 2).n-gram match reranker.We added a second reranker for the k-best outputs of the generator that promotes outputs that have a word or phrase overlap with the context utterance.We use geometric mean of modified n-gram precisions (with n ∈ {1, 2}) as a measure of context overlap, i.e., BLEU-2 (Papineni et al., 2002) without brevity penalty.The log probability l of an output sequence on the generator k-best list is updated as follows: In (4), p 1 and p 2 are modified unigram and bigram precisions of the output sequence against the context, and w is a preset weight.We believe that any reasonable measure of contextual match would be viable here, and we opted for modified n-gram precisions because of simple computation, welldefined range, and the relation to the de facto standard BLEU metric. 5We only use unigrams and bigrams to promote especially the reuse of single words or short phrases.
In addition, we combine the n-gram match reranker with both of the two former approaches.
We used gold-standard transcriptions of the immediately preceding user utterance in our experiments in order to test the context-aware capabilities of our system in a stand-alone setting; in a live SDS, 1-best speech recognition hypotheses and longer user utterance history can be used with no modifications to the architecture.

Experiments
We experiment on the publicly available dataset of Dušek and Jurčíček (2016a) 6 for NLG in the pub-lic transport information domain, which includes preceding context along with each pair of input DA and target natural language sentence.It contains over 5,500 utterances, i.e., three paraphrases for each of the over 1,800 combinations of input DA and context user utterance.The data concern bus and subway connections on Manhattan, and comprise four DA types (iconfirm, inform, inform no match, request).They are delexicalized for generation to avoid sparsity, i.e., stop names, vehicles, times, etc., are replaced by placeholders (Wen et al., 2015a).We applied a 3:1:1 split of the set into training, development, and test data.We use the three paraphrases as separate instances in training data, but they serve as three references for a single generated output in validation and evaluation.
We test the three context-aware setups described in Section 2.2 and their combinations, and we compare them against the baseline noncontext-aware seq2seq generator.Same as Dušek and Jurčíček (2016b), we train the seq2seq models by minimizing cross-entropy on the training set using the Adam optimizer (Kingma and Ba, 2015), and we measure BLEU on the development set after each pass over the training data, selecting the best-performing parameters. 7The content classification reranker is trained in a similar fashion, measuring misclassification on both training and development set after each pass. 8We use 5 different random initializations of the networks and context nlg dataset), which contains several small fixes.
7 Based on our preliminary experiments on development data, we use embedding size 50, LSTM cell size 128, learning rate 0.0005, and batch size 20.Training is run for at least 50 and up to 1000 passes, with early stopping if the top 10 validation BLEU scores do not change for 100 passes.average the results.
Decoding is run with a beam size of 20 and the penalty weight for content classification reranker set to 100.We set the n-gram match reranker weight based on experiments on development data.9

Evaluation Using Automatic Metrics
Table 1 lists our results on the test data in terms of the BLEU and NIST metrics (Papineni et al., 2002;Doddington, 2002).We can see that while the n-gram match reranker brings a BLEU score improvement, using context prepending or separate encoder results in scores lower than the baseline.10However, using the n-gram match reranker together with context prepending or separate encoder brings significant improvements of about 2.8 BLEU points in both cases, better than using the n-gram match reranker alone. 11We believe that adding the context information into the decoder does increase the chances of contextually appropriate outputs appearing on the decoder kbest lists, but it also introduces a lot more uncertainty and therefore, the appropriate outputs may not end on top of the list based on decoder scores alone.The n-gram match reranker is then able to promote the relevant outputs to the top of the k-best list.However, if the generator itself does not have access to context information, the n-gram match reranker has a smaller effect as contextually appropriate outputs may not appear on the k-best lists at all.A closer look at the generated outputs confirms that entrainment is present in sentences generated by the context-aware setups (see Fig. 2).
In addition to BLEU and NIST scores, we measured the slot error rate ERR (Wen et al., 2015b), i.e., the proportion of missing or superfluous slot placeholders in the delexicalized generated outputs.For all our setups, ERR stayed around 3%.

Human Evaluation
We evaluated the best-performing setting based on BLEU/NIST scores, i.e., prepending context with n-gram match reranker, in a blind pairwise preference test with untrained judges recruited on the CrowdFlower crowdsourcing platform. 12The judges were given the context and the system output for the baseline and the context-aware system, and they were asked to pick the variant that sounds more natural.We used a random sample of 1,000 pairs of different system outputs over all 5 random initializations of the networks, and collected 3 judgments for each of them.The judges preferred the context-aware system output in 52.5% cases, significantly more than the baseline. 13e examined the judgments in more detail and found three probable causes for the rather small difference between the setups.First, both setups' outputs fit the context relatively well in many cases and the judges tend to prefer the overall more frequent variant (e.g., for the context "starting from Park Place", the output "Where do you want to go?" is preferred over "Where are you going to?").Second, the context-aware setup often selects a shorter response that fits the context well (e.g., "Is there an option at 10:00 am?" is confirmed simply with "At 10:00 am."), but the judges seem to prefer the more eloquent variant.And third, both setups occasionally produce non-fluent outputs, which introduces a certain amount of noise.

Related Work
Our system is an evolutionary improvement over the LSTM seq2seq system of Dušek and Jurčíček (2016b) and as such, it is most related in terms of architecture to other recent RNN-based approaches to NLG, which are not context-aware: RNN generation with a convolutional reranker by Wen et al. (2015a) and an improved LSTM-based version (Wen et al., 2015b), as well as the LSTM encoder-aligner-decoder NLG system of Mei et al. (2015).The recent end-to-end trainable SDS of Wen et al. (2016) does have an implicit access to previous context, but the authors do not focus on its influence on the generated responses.
There have been several attempts at modelling entrainment in dialogue (Brockmann et al., 2005;Reitter et al., 2006;Buschmeier et al., 2010) and even successful implementations of entrainment models in NLG systems for SDS, where entrainment caused an increase in perceived naturalness of the system responses (Hu et al., 2014)   2013; Lopes et al., 2015).However, all of the previous approaches are completely or partially rulebased.Most of them attempt to model entrainment explicitly, focus on specific entrainment phenomena only, and/or require manually selected lists of variant expressions, while our system learns synonyms and entrainment rules implicitly from the corpus.A direct comparison with previous entrainment-capable NLG systems for SDS is not possible in our stand-alone setting since their rules involve the history of the whole dialogue whereas we focus on the preceding utterance in our experiments.

Conclusions and Further Work
We presented an improvement to our natural language generator based on the sequence-tosequence approach (Dušek and Jurčíček, 2016b), allowing it to exploit preceding context user utterances to adapt (entrain) to the user's way of speaking and provide more contextually accurate and less repetitive responses.We used two different ways of feeding previous context into the generator and a reranker based on n-gram match against the context.Evaluation on our context-aware dataset (Dušek and Jurčíček, 2016a) showed a significant BLEU score improvement for the combination of the two approaches, which was confirmed in a subsequent human pairwise preference test.Our generator is available on GitHub at the following URL: https://github.com/UFAL-DSG/tgen In future work, we plan on improving the ngram matching metric to allow fuzzy matching (e.g., capturing different forms of the same word), experimenting with more ways of incorporating context into the generator, controlling the output eloquence and fluency, and most importantly, evaluating our generator in a live dialogue system.We also intend to evaluate the generator with automatic speech recognition hypotheses as context and modify it to allow n-best hypotheses as contexts.Using our system in a live SDS will also allow a comparison against previous handcrafted entrainment-capable NLG systems.

Figure 1 :
Figure 1: An example of NLG input and output, with context-aware additions.

Table 1 :
BLEU and NIST scores of different generator setups on the test data.
or increased naturalness and task success(Lopes et al.,

Table 2 :
Example outputs of the different setups of our generator (with entrainment highlighted)