Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings

We present a natural language generator based on the sequence-to-sequence approach that can be trained to produce natural language strings as well as deep syntax dependency trees from input dialogue acts, and we use it to directly compare two-step generation with separate sentence planning and surface realization stages to a joint, one-step approach. We were able to train both setups successfully using very little training data. The joint setup offers better performance, surpassing state-of-the-art with regards to n-gram-based scores while providing more relevant outputs.


Introduction
In spoken dialogue systems (SDS), the task of natural language generation (NLG) is to convert a meaning representation (MR) produced by the dialogue manager into one or more sentences in a natural language.It is traditionally divided into two subtasks: sentence planning, which decides on the overall sentence structure, and surface realization, determining the exact word forms and linearizing the structure into a string (Reiter and Dale, 2000).While some generators keep this division and use a two-step pipeline (Walker et al., 2001;Rieser et al., 2010;Dethlefs et al., 2013), others apply a joint model for both tasks (Wong and Mooney, 2007;Konstas and Lapata, 2013).
We present a new, conceptually simple NLG system for SDS that is able to operate in both modes: it either produces natural language strings or generates deep syntax dependency trees, which are subsequently processed by an external surface realizer (Dušek et al., 2015).This allows us to show a direct comparison of two-step generation, where sentence planning and surface realization are separated, with a joint, one-step approach.
Our generator is based on the sequence-tosequence (seq2seq) generation technique (Cho et al., 2014;Sutskever et al., 2014), combined with beam search and an n-best list reranker to suppress irrelevant information in the outputs.Unlike most previous NLG systems for SDS (e.g., (Stent et al., 2004;Raux et al., 2005;Mairesse et al., 2010)), it is trainable from unaligned pairs of MR and sentences alone.We experiment with using much less training data than recent systems based on recurrent neural networks (RNN) (Wen et al., 2015b;Mei et al., 2015), and we find that our generator learns successfully to produce both strings and deep syntax trees on the BAGEL restaurant information dataset (Mairesse et al., 2010).It is able to surpass n-gram-based scores achieved previously by Dušek and Jurčíček (2015), offering a simpler setup and more relevant outputs.
We introduce the generation setting in Section 2 and describe our generator architecture in Section 3. Section 4 details our experiments, Section 5 analyzes the results.We summarize related work in Section 6 and offer conclusions in Section 7.

Generator Setting
The input to our generator are dialogue acts (DA) (Young et al., 2010) representing an action, such as inform or request, along with one or more attributes (slots) and their values.Our generator operates in two modes, producing either deep syntax trees (Dušek et al., 2012) or natural language strings (see Fig. 1).The first mode corresponds to the sentence planning NLG stage as it decides the syntactic shape of the output sentence; the resulting deep syntax tree involves content words (lemmas) and their syntactic form (formemes, purple in Fig. 1).The trees are linearized to strings using a  surface realizer from the TectoMT translation system (Dušek et al., 2015).The second generator mode joins sentence planning and surface realization into one step, producing natural language sentences directly.
Both modes offer their advantages: The twostep mode simplifies generation by abstracting away from complex surface syntax and morphology, which can be handled by a handcrafted, domain-independent module to ensure grammatical correctness at all times (Dušek and Jurčíček, 2015), and the joint mode does not need to model structure explicitly and avoids accumulating errors along the pipeline (Konstas and Lapata, 2013).

The Seq2seq Generation Model
Our generator is based on the seq2seq approach (Cho et al., 2014;Sutskever et al., 2014), a type of an encoder-decoder RNN architecture operating on variable-length sequences of tokens.We address the necessary conversion of input DA and output trees/sentences into sequences in Section 3.1 and then describe the main seq2seq component in Section 3.2.It is supplemented by a reranker, as explained in Section 3.3.

Sequence Representation of DA, Trees, and Sentences
We represent DA, deep syntax trees, and sentences as sequences of tokens to enable their usage in the sequence-based RNN components of our generator (see Sections 3.2 and 3.3).Each token is represented by its embedding -a vector of floatingpoint numbers (Bengio et al., 2003).
To form a sequence representation of a DA, we create a triple of the structure "DA type, slot, value" for each slot in the DA and concatenate the triples (see Fig. 3).The deep syntax tree output from the seq2seq generator is represented in a bracketed notation similar to the one used by Vinyals et al. (2015, see Fig. 2).The inputs to the reranker are always a sequence of tokens; structure is disregarded in trees, resulting in a list of lemma-formeme pairs (see Fig. 2).
The decoder stage then uses the hidden states to generate a sequence y = {y 1 , . . ., y m } with a second LSTM-based RNN.The probability of each output token is defined as: Here, s t is the decoder state where s 0 = h n and s t = lstm((y t−1 • c t )W S , s t−1 ), i.e., the decoder is initialized by the last hidden state and uses the previous output token at each step.W Y and W S are learned linear projection matrices and "•" denotes concatenation.c t is the context vector -a weighted sum of the encoder hidden states c t = n i=1 α ti h i , where α ti corresponds to an alignment model, represented by a feed-forward network with a single tanh hidden layer.
On top of this basic seq2seq model, we implemented a simple beam search for decoding (Sutskever et al., 2014;Bahdanau et al., 2015).It proceeds left-to-right and keeps track of log probabilities of top n possible output sequences, expanding them one token at a time.

Reranker
To ensure that the output trees/strings correspond semantically to the input DA, we implemented a classifier to rerank the n-best beam search outputs and penalize those missing required information and/or adding irrelevant one.Similarly to Wen et al. (2015a), the classifier provides a binary decision for an output tree/string on the presence of all dialogue act types and slot-value combinations seen in the training data, producing a 1-hot vector.
( <root> <root> ( ( X-name n:subj ) be v:fin ( ( Italian adj:attr ) restaurant n:obj ( river n:near+X ) ) ) ) X-name n:subj be v:fin Italian adj:attr restaurant n:obj river n:near+X The reranker The input DA is converted to a similar 1-hot vector and the reranking penalty of the sentence is the distance between the two vectors (see Fig. 4).Weighted penalties for all sentences are subtracted from their n-best list log probabilities.
We employ a similar architecture for the classifier as in our seq2seq generator encoder (see Section 3.2), with an RNN encoder operating on the output trees/strings and a single logistic layer for classification over the last encoder hidden state.Given an output sequence representing a string or a tree y = {y 1 , . . ., y n } (cf.Section 3.1), the encoder again produces a sequence of hidden states h = {h 1 , . . ., h n } where h t = lstm(y t , h t−1 ).The output binary vector o is computed as: Here, W R is a learned projection matrix and b is a corresponding bias term.

Experiments
We perform our experiments on the BAGEL data set of Mairesse et al. (2010), which contains 202 DA from the restaurant information domain with two natural language paraphrases each, describing restaurant locations, price ranges, food types etc.Some properties such as restaurant names or phone numbers are delexicalized (replaced with "X" symbols) to avoid data sparsity.2Unlike Mairesse et al. (2010), we do not use manually annotated alignment of slots and values in the input DA to target words and phrases and let the generator learn it from data, which simplifies training data preparation but makes our task harder.We lowercase the data and treat plural -s as separate tokens for generating into strings, and we apply automatic analysis from the Treex NLP toolkit (Popel and Žabokrtský, 2010) to obtain deep syntax trees for training tree-based generator setups.3Same as Mairesse et al. (2010), we apply 10-fold cross-validation, with 181 training DA and 21 testing DA.In addition, we reserve 10 DA from the training set for validation. 4 To train our seq2seq generator, we use the Adam optimizer (Kingma and Ba, 2015) to minimize unweighted sequence cross-entropy. 5We perform 10 runs with different random initialization of the network and up to 1,000 passes over the training data, 6 validating after each pass and selecting the parameters that yield the highest BLEU score on the validation set.Neither beam search nor the reranker are used for validation.
We use the Adam optimizer minimizing crossentropy to train the reranker as well. 7We perform a single run of up to 100 passes over the data, and we also validate after each pass and select the parameters giving minimal Hamming distance on both validation and training set.2010) use manual alignments in their work, so their result is not directly comparable to ours.The zero semantic error is implied by the manual alignments and the architecture of their system.

Results
The results of our experiments and a comparison to previous works on this dataset are shown in Table 1.We include BLEU and NIST scores and the number of semantic errors (incorrect, missing, and repeated information), which we assessed manually on a sample of 42 output sentences (outputs of two randomly selected cross-validation runs).
The outputs of direct string generation show that the models learn to produce fluent sentences in the domain style;9 incoherent sentences are rare, but semantic errors are very frequent in the greedy search.Most errors involve confusion of semantically close items, e.g., Italian instead of French or riverside area instead of city centre (see Table 2); items occurring more frequently are preferred regardless of their relevance.The beam search brings a BLEU improvement but keeps most semantic errors in place.The reranker is able to reduce the number of semantic errors while increasing automatic scores considerably.Using a larger beam increases the effect of the reranker as expected, resulting in slightly improved outputs.
Models generating deep syntax trees are also able to learn the domain style, and they have virtually no problems producing valid trees. 10The surface realizer works almost flawlessly on this lim-ited domain (Dušek and Jurčíček, 2015), leaving the seq2seq generator as the major error source.The syntax-generating models tend to make different kinds of errors than the string-based models: Some outputs are valid trees but not entirely syntactically fluent; missing, incorrect, or repeated information is more frequent than a confusion of semantically similar items (see Table 2).Semantic error rates of greedy and beam-search decoding are lower than for string-based models, partly because confusion of two similar items counts as two errors.The beam search brings an increase in BLEU but also in the number of semantic errors.The reranker is able to reduce the number of errors and improve automatic scores slightly.A larger beam leads to a small BLEU decrease even though the sentences contain less errors; here, NIST reflects the situation more accurately.
A comparison of the two approaches goes in favor of the joint setup: Without the reranker, models generating trees produce less semantic errors and gain higher BLEU/NIST scores.However, with the reranker, the string-based model is able to reduce the number of semantic errors while producing outputs significantly better in terms of BLEU/NIST.11In addition, the joint setup does not need an external surface realizer.The best results of both setups surpass the best results on this dataset using training data without manual alignments (Dušek and Jurčíček, 2015) in both automatic metrics12 and the number of semantic errors.

Related Work
While most recent NLG systems attempt to learn generation from data, the choice of a particular approach -pipeline or joint -is often arbitrary and depends on system architecture or particular generation domain.Works using the pipeline approach in SDS tend to focus on sentence planning, improving a handcrafted generator (Walker et al., 2001;Stent et al., 2004;Paiva and Evans, 2005) or using perceptron-guided A* search (Dušek and Jurčíček, 2015).Generators taking the joint approach employ various methods, e.g., factored language models (Mairesse et al., 2010), inverted parsing (Wong and Mooney, 2007;Konstas and Lapata, 2013), or a pipeline of discriminative classifiers (Angeli et al., 2010).Unlike most previous Input DA inform (name=X-name, type=placetoeat, eattype=restaurant, area=citycentre, near=X-near, food="Chinese takeaway", food=Japanese) Reference X is a Chinese takeaway and Japanese restaurant in the city centre near X.Greedy with trees X is a restaurant offering chinese takeaway in the centre of town near X.
[Japanese] + Beam search X is a restaurant and japanese food and chinese takeaway.+ Reranker X is a restaurant serving japanese food in the centre of the city that offers chinese takeaway.Greedy into strings X is a restaurant offering italian and indian takeaway in the city centre area near X. [Japanese, Chinese] + Beam search X is a restaurant that serves fusion chinese takeaway in the riverside area near X. [Japanese, citycentre] + Reranker X is a japanese restaurant in the city centre near X providing chinese food.
[takeaway] Input DA inform(name=X-name, type=placetoeat, eattype=restaurant, area=riverside, food=French) Reference X is a French restaurant on the riverside.Greedy with trees X is a restaurant providing french and continental and by the river.+ Beam search X is a restaurant that serves french takeaway.
[riverside] + Reranker X is a french restaurant in the riverside area.Greedy into strings X is a restaurant in the riverside that serves italian food.Wen et al. (2015b) later replaced basic sigmoid cells with an LSTM.Mei et al. (2015) present the only seq2seq-based NLG system known to us.We extend the previous works by generating deep syntax trees as well as strings and directly comparing pipeline and joint generation.In addition, we experiment with an order-of-magnitude smaller dataset than other RNN-based systems.

Conclusions and Future Work
We have presented a direct comparison of two-step generation via deep syntax trees with a direct generation into strings, both using the same NLG system based on the seq2seq approach.While both approaches offer decent performance, their outputs are quite different.The results show the direct approach as more favorable, with significantly higher n-gram based scores and a similar number of semantic errors in the output.
We also showed that our generator can learn to produce meaningful utterances using a much smaller amount of training data than what is typically used for RNN-based approaches.The resulting models had virtually no problems with produc-ing fluent, coherent sentences or with generating valid structure of bracketed deep syntax trees.Our generator was able to surpass the best BLEU/NIST scores on the same dataset previously achieved by a perceptron-based generator of Dušek and Jurčíček (2015) while reducing the amount of irrelevant information on the output.
Our generator is released on GitHub at the following URL: https://github.com/UFAL-DSG/tgen We intend to apply it to other datasets for a broader comparison, and we plan further improvements, such as enhancing the reranker or including a bidirectional encoder (Bahdanau et al., 2015;Mei et al., 2015;Jean et al., 2015) and sequence level training (Ranzato et al., 2015).

Figure 1 :
Figure 1: Example DA (top) with the corresponding deep syntax tree (middle) and natural language string (bottom)

Figure 2 :
Figure 2: Trees encoded as sequences for the seq2seq generator (top) and the reranker (bottom)

Table 1 :
8Results on the BAGEL data set NIST, BLEU, and semantic errors in a sample of the output.

Table 2 :
Example outputs of different generator setups (beam size 100 is used).Errors are marked in color (missing, superfluous, repeated information, disfluency).