A Tree-to-Sequence Model for Neural NLG in Task-Oriented Dialog

Generating fluent natural language responses from structured semantic representations is a critical step in task-oriented conversational systems. Sequence-to-sequence models on flat meaning representations (MR) have been dominant in this task, for example in the E2E NLG Challenge. Previous work has shown that a tree-structured MR can improve the model for better discourse-level structuring and sentence-level planning. In this work, we propose a tree-to-sequence model that uses a tree-LSTM encoder to leverage the tree structures in the input MR, and further enhance the decoding by a structure-enhanced attention mechanism. In addition, we explore combining these enhancements with constrained decoding to improve semantic correctness. Our experiments not only show significant improvements over standard seq2seq baselines, but also is more data-efficient and generalizes better to hard scenarios.


Introduction
Generating fluent natural language responses from structured semantic representations is crucial to building engaging and effective task-oriented dialog systems. Neural approaches for natural language generation (NNLG), particularly sequenceto-sequence approaches, have achieved promising results and were dominant in the recent E2E Challenge. Most of these approaches are built on flat meaning representations (MR) that use keyvalue pairs to capture attributes to be conveyed in responses. However, coupled with such flat MRs, current NNLG methods still struggle with 1) reliably performing sentence-level planning and discourse-level structuring (Reed et al., 2018); 2) avoiding generating semantic errors like hallucinated content (Dušek et al., 2018(Dušek et al., , 2019; and 3) generalizing to hard inputs (Wiseman et al., 2017).
To help overcome these drawbacks, Balakrishnan et al. (2019) propose a novel tree-structured meaning representation to gain better control of the discourse structure and content in generated utterances. Their proposed tree-structured MRs consist of three sets of non-terminal tokens: argument, dialog act and discourse act. A dialog act is a minimum atomic unit that contains a few arguments to be expressed in an utterance, while discourse acts define the relationship between dialog acts.
An example of their tree-structured MR for the weather domain is provided in Table 1, along with a flat MR and human reference. We also add a reference annotated with the tree-structured MR in the last row. The tree-structured MR provides much better controllability to a live task-oriented dialog system, where developers can easily inject external knowledge into a rule-based response planner to specify the relationship between multiple dialog acts (e.g., rainy is the opposite to sunny), and the grouping of arguments in a dialog act is possible. These consideration have been shown to be critical to user perceptions of quality and naturalness (Lemon et al., 2004;Carenini and Moore, 2006;Walker et al., 2007;White et al., 2010;Demberg et al., 2011).
In their Seq2Seq model, Balakrishnan et al. (2019) treat the tree-structured MR as just a sequence of tokens, ignoring the inherent tree structure (though this structure is taken into account in constrained decoding). We aim to examine the hypothesis that a better representation of the input tree structures could lead to better generalizability of the model and enhance semantic correctness. Therefore, we propose a tree-to-sequence model that uses a tree-based encoder to better represent the tree-structured MRs, and a structure-enhanced decoder to further incorporate contextual information in decoding.
Our contributions are summarized as follows: Reference It'll be sunny throughout this weekend. The high will be in the 60s, but expect temperatures to drop as low as 43 degrees by Sunday evening. There's also a chance of strong winds on Saturday morning. • We propose a tree-to-sequence (tree2seq) model to better leverage the inherent structures in the tree-based MRs. Coupled with the constrained decoding technique from (Balakrishnan et al., 2019), we explore whether combining better learning and decoding methods yields the best performance.
• Extensive evaluations on conversational weather and E2E datasets (Dušek et al., 2019) show that the tree2seq model can significantly improve semantic correctness. Analysis further shows that tree2seq is more data-efficient and generalizes better to hard scenarios.

Related Work
Several previous works have focused on adding planning steps to neural NLG architectures or employed non-sequential encoders. Puduppully et al.
(2019) add a content planning step where a set of input database records are mapped to an ordered list of selected records; however, their approach does not employ hierarchical content plans as in our approach. Moryossef et al. (2019) add a symbolic text planning step where facts are grouped and ordered in the input; in contrast to our work though, their approach uses standard Seq2Seq models for realization and leaves no ordering choices to the model. Previous work on AMR and WebNLG (Beck et al., 2018;Song et al., 2018;Marcheggiani and Perez-Beltrachini, 2018) has demonstrated improvements over Seq2Seq models by using graph-to-sequence models; while similar in principle, these works do not explore the use of hierarchical content plans as intermediate structures and do not experiment with constrained decoding. Elder et al. (2019) propose using an intermediate representation motivated by a universal dependency tree, and find that this greatly improves performance. However, their approach is still Seq2Seq-based and can't explicitly model the tree structures. Similar to our approach, Eriguchi et al. (2016) use a tree-to-sequence model for machine translation, but here we focus on NLG and use different tree encoder and constrained decoding techniques.
3 Tree-to-Sequence Model

Tree-Based Encoder
The input to our model is a tree-structured MR, and the output is an annotated reference, e.g., the last row in Table 1. Having annotated nonterminal tokens in the output allows us to check whether all arguments are expressed in output following the input tree structures.
We represent each token in the input MR as a tree node, using the tree structure to compute the hidden state of the k-th parent node h p k as a function of its child states {h c 1 k , ..., h c N k }: where N is the number of children for k-th node and f tree is a non-linear function. We implemented a variant of the N-ary TreeLSTM by (Tai et al., 2015) as our tree encoder.
Since trees can have completely different layouts, it's hard to train and do inference with tree inputs in parallel. We propose an iterative bottomup traversal algorithm to support batch forward and backward with tree inputs. Given a batch of trees, we first extract all the leaf nodes and update their states in a batch manner. Then we iteratively update the states of non-leaf nodes if all of their children nodes have been processed. As nodes can have different number of children nodes, we padded non-leaf nodes to have the same number of children nodes (i.e., N ) for batch processing. Overall, the batch calculation ends up with 5-10X speedup compared to single-tree forward, allowing us to train on large datasets. 1

Structure-Enhanced Decoder
The tree-based encoder returns a list of hidden states {h 1 , ..., h K }, where K is the length of source sequence. We first initialize the initial decoder state s 1 as its root hidden state: In a standard attentional seq2seq (Bahdanau et al., 2014), α j (k) denotes the attention score between j-th target state s j and k-th source state h k . Then a weighted sum over source hidden states are calculated as d j = k α j (k)h k , and is used for updating the context state as follows: where [s j ; d j ] is a concatenation of hidden state s j and d j . Next s j is used for predicting the j-th target token: However, the above decoding procedure doesn't take the tree structures into account. We adopted the input feeding approach (Eriguchi et al., 2016) by modifying equation (1) to feed the previous unit s j−1 to update the j-th context state: The input feeding approach allows us to enrich the contextual information when predicting the current token, in particular becauseŝ j−1 is often the parent state of j-th node (given that the output tree structures are linearized to a sequence of words).

Constrained Decoding
Balakrishnan et al. (2019) propose a constrained decoding approach that derives constraints from the input tree structure to be enforced during decoding. In the beam search process, if a predicted non-terminal token violates the input MR structure, then the token is rejected. This allows beam search to explore more valid hypotheses with the same beam size. Their experiments show that constrained decoding can significantly improve the semantic correctness of generated responses by avoiding missing/repeating arguments and reducing hallucinated content while also enforcing desired groupings. (See their paper for further details on how the constraints are enforced.) Though constrained decoding yields promising results in Balakrishnan et al.'s experiments, it's worth observing that constrained decoding does not affect the training process, which means that it doesn't help with generalization and relies on a strong base model. Therefore, we experiment with combining our tree-to-sequence model with constrained decoding, in order to determine whether the two methods work better in combination.

Setup
Datasets: We conducted experiments on both the enriched E2E dataset and the weather dataset from (Balakrishnan et al., 2019).
Models We consider both Seq2Seq-based models and our proposed Tree2Seq models in our experiments. All Seq2Seq models use an LSTMbased encoder and decoder, with attention, while the Tree2Seq models have the architecture described in Section 3.
• S2S: Standard S2S-TREE model proposed in (Balakrishnan et al., 2019). This is a Seq2Seq model in which the input is a linearized text representation of the MR, while the output is an annotated response (example in Table 1). • S2S-CONSTR: This is the S2S-CONSTR model proposed in (Balakrishnan et al., 2019). This is identical in architecture to the S2S model, and differs only in the decoding step, where constrained decoding is applied to ensure semantic correctness. • T2S: Our proposed model, with tree-based encoding and structure-enhanced decoding. • T2S-CONSTR: Has the same architecture as T2S, but with constrained decoding applied to the decoder to ensure semantic correctness.
Metrics We consider both automatic metrics and human evaluation results. For automatic metrics, we evaluate on following automatic metrics:    (Papineni et al., 2002); 2) Tree accuracy (Balakrishnan et al., 2019), which is a binary metric to indicate whether the tree structure in the prediction matches that of the input MR exactly.
For human evaluation, annotators rate model responses in a binary scale on two dimensions: • Grammaticality (Gram): Our evaluation guidelines included considerations for proper subject-verb agreement, word order, grammatical completeness, etc..
• Correctness (Corr): Measures semantic correctness of the responses. Our guidelines considered sentence structure, contrast, hallucinations (incorrectly included attributes), and missing attributes. We asked annotators to evaluate model predictions against the reference rather than the MR.
Our human evaluation was conducted in a doubleblind setting, in which two annotators independently provide ratings for each response, and a third annotator resolves any disagreements between the two. The disagreement rate is 20.7% for the weather dataset and 23.2% for the E2E dataset. Table 2 shows the main results. For the tree accuracy metric, we report the numbers on two disjoint subsets: discourse subset (column DISC), which contains inputs with 1+ discourse acts, and no-discourse subset (column NODISC), which includes inputs without any discourse acts. The discourse subset is expected to be more challenging as it contains longer and more complex inputs. From the table, we can see that all approaches are roughly comparable on BLEU scores. With tree accuracy, T2S consistently outperforms S2S in on both the discourse and no-discourse subsets, with the exception of the NODISC subset of the E2E data, where all models are close to 100% accuracy. The margins of improvement from T2S are higher on the discourse subset, suggesting T2S is more effective on hard inputs. S2S-Constr consistently outperforms S2S and T2S, affirming the effectiveness of constrained decoding. Overall, combining the enhanced encoding and decoding methods, T2S-Constr achieves the best performance on all subsets (again, except with NODISC tree accuracy for E2E, where ceiling performance is effectively reached).

Main Results
For grammaticality, we see all approaches are comparable in E2E and Weather. Analysis show above 90% grammatical errors are because models tend to generate run-on responses that group many arguments in one sentence, without appropriate punctuation (e.g., commas). Consistent with Balakrishnan et al. (2019), we found higher tree accuracy usually corresponds to higher human judgements of semantic correctness (except with S2S-Constr and T2S-Constr for E2E). We also note that there's a noticeable gap in the E2E dataset  where the tree accuracy doesn't align with the correctness numbers from human evaluation (the gap on weather is smaller). Our analysis show most correctness errors are mainly due to: 1) the compositional MR inputs were missing information in the reference which was caused by the noisy dataset creation by Balakrishnan et al. (2019); 2) some attributes caused confusions to human annotators, e.g., "20-30 pounds" can imply "a midpriced restaurant"; 3) a legitimate amount of content hallucinations, especially in hard inputs and unseen attributes.
We also plot the tree accuracy distribution against the number of dialog acts in Figure 1. (We skip this figure for E2E dataset for space reasons, as it shows similar pattern to weather dataset.) Clearly, for smaller numbers of dialog acts (#Di-alogActs <= 3), all models perform roughly the same and almost hit 100% accuracy. But the gains of T2S is much more clear when the number of dialog acts is larger than 3. T2S-Constr is also generally better than S2S-Constr in most cases, and both are more effective for complex MRs (7 or 8+ dialog acts), where there are very few MRs (less than 0.5%) in the training set.
Data Efficiency. We set up a data efficiency experiment, in which we trained each model on increasingly larger subsets of our training set (while keeping the test set constant). Figure 2 shows the results of this experiment. Overall T2S consistently outperforms S2S, and the difference is larger with fewer training samples. This suggests that structure awareness leads to better representa-tions and improves data efficiency.

Sample Analysis
We also provide some sample responses on E2E in Table 3. Column 'G/T/C' stands for the value of grammaticality, tree accuracy and correctness of model prediction. We obviate S2S-Constr response here as it is similar to T2S-Constr.
From the example, we can see that T2S mistakenly contrasted the family friendly and customer rating attributes, largely due to the overwhelming contrast patterns between family friendly and customer rating in training data. In addition to the contrast mistake, S2S completely ignores the attribute of serving Italian food, suggesting its poor generalization ability to rare argument (i.e., food italian). T2S-Constr shared the first sentence with the T2S approach, but was able to correct the constrast mistake by adding structure constraints during beam search.

Conclusion
In this paper, we have demonstrated via experiments on two datasets that a tree-to-sequence model that leverages the inherent tree structures in input MRs can improve semantic correctness over a sequence-to-sequence model and is more dataefficient. Moreover, we have shown that the treeto-sequence model can be coupled with a better constrained decoding method to achieve better semantic correctness than either method alone.