Recursive Template-based Frame Generation for Task Oriented Dialog

The Natural Language Understanding (NLU) component in task oriented dialog systems processes a user’s request and converts it into structured information that can be consumed by downstream components such as the Dialog State Tracker (DST). This information is typically represented as a semantic frame that captures the intent and slot-labels provided by the user. We first show that such a shallow representation is insufficient for complex dialog scenarios, because it does not capture the recursive nature inherent in many domains. We propose a recursive, hierarchical frame-based representation and show how to learn it from data. We formulate the frame generation task as a template-based tree decoding task, where the decoder recursively generates a template and then fills slot values into the template. We extend local tree-based loss functions with terms that provide global supervision and show how to optimize them end-to-end. We achieve a small improvement on the widely used ATIS dataset and a much larger improvement on a more complex dataset we describe here.


Introduction
The output of an NLU component is called a semantic or dialog frame (Hakkani-Tür et al., 2016). The frame consists of intents which capture information about the goal of the user and slot-labels which capture constraints that need to be satisfied in order to fulfill the users' request. For example, in Figure 1, the intent is to book a flight (atis flight) and the slot labels are the from location, to location and the date. The intent detection task can be modeled as a classification problem and slot labeling as a sequential labeling problem.
The ATIS (Airline Travel Information System) dataset (Hakkani-Tür et al., 2010) is widely used for evaluating the NLU component. We focus on complex aspects of dialog that occur in real-world  scenarios but are not captured in ATIS or other alternatives such as, DSTC (Henderson et al., 2014) or SNIPS 1 . As an example, consider a reasonable user utterance, "can i get two medium veggie pizza and one small lemonade" (Figure 2A). The intent is OrderItems. There are two items mentioned, each with three properties. The properties are the name of the item (veggie pizza, lemonade), the quantity of the item (two, one) and size of the item (medium, small). These properties need to be grouped together accurately to successfully fulfill the customer's request -the customer would not be happy with one small veggie pizza. This structure occurs to a limited extent in the ATIS dataset ( Figure 2B), which has specific forms such as, from loc.city name and to loc.city name, which must be distinguished. However, the scale is small enough that these can be separate labels and multi-class slot-labeling approaches that predict each specific form as a separate class ( Figure  1) have had success. In more open domains, this hierarchy-to-multi-class conversion increases the number of classes exponentially vs. an approach that appropriately uses available structure. Further, hierarchical relationships, e.g. between fromloc and city name, are ignored, which limits the sharing of data and statistical strength across labels.
The contributions of this paper are as follows: • We propose a recursive, hierarchical framebased representation that captures complex relationships between slots labels, and show how to learn this representation from raw user text. This enables sharing statistical strength across labels. Such a representation ( Figure 3) also allows us to include multiple intents in a single utterance (Gangadharaiah and Narayanaswamy, 2019; Kim et al., 2017;Xu and Sarikaya, 2013).
• We formulate frame generation as a templatebased tree-decoding task (Section 3). The value or positional information at each terminal (represented by a $) in the template generated by the tree decoder is predicted (or filled in) using a pointer to the tokens in the input sentence (Vinyals et al., 2015;Jia and Liang, 2016). This allows the system to copy over slot values directly from the input utterance.
• We extend (local) tree-based loss functions with global supervision (Section 3.5), optimize jointly for all loss functions end-to-end and show that this improves performance (Section 4).

Related Work
Encoder-Decoder architectures, e.g. Seq2Seq models (Sutskever et al., 2014), are a popular class of approaches to the problem of mapping source sequences (here words) to target sequences (here slot labels) of variable length. Seq2Seq models have been used to generate agent responses without the need for intermediate dialog components such as the DST or the Natural Language Generator (Gangadharaiah et al., 2018). However, there has not been much work that uses deeper knowledge of semantic representations in task-oriented dialog. A notable exception is recent work by Gupta et.al (2018), who used a hierarchical representation for dialog that can be easily parsed by off-the-shelf constituency-based parsers. Neural constituency parsers (Socher et al., 2011;Shen et al., 2018) work directly off the input sentence, and as a result, different sentences with the same meaning end up having different syntactic structures.

Proposed Approach
We learn to map a user's utterance x = {x 1 , x 2 , ...x n } to a template-based tree representation ( Figure 2), specifically the bracketed representation in Figure 3. We denote the symbols in the bracketed representation by y = {y 1 , y 2 , ..y m }. The translation from x to y is performed using four components that are jointly trained end-to-end, (1) an encoder, (2) a slot decoder, (3) a tree decoder ( Figure 4) and (4) a pointer network. Each of these components is briefly explained below.

Encoder:
We use BERT (Devlin et al., 2019) as the encoder to obtain token embeddings which are fine-tuned during the end-to-end learning. This can be replaced with any other choice of embedding.

Slot Decoder:
The slot decoder accepts embeddings from the encoder, is deep, and has a dense final layer which predicts the slot label for each token positionâ =â 1 ,â 2 , ...â n . The true slot label a = a 1 , a 2 , ...a n is the general form of the label. For example, city name, month name and day name are the general forms obtained from fromloc.city name, toloc.city name, depart date.month name, depart date.day name.
The decoder learns to predict Begin-Inside-Outside (BIO) tags, since this allows the tree decoder to focus on producing a tree form and requires the slot decoder to perform boundary detection. The slot decoder is trained to minimize a supervised loss, where, π SL is the output of the softmax layer at output position i.â <i represents slot labels predicted upto position i − 1.

Template-based Tree Decoder
The tree decoder works top down as shown in Figure 4. Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) models are used to generate tokens and symbols. In the example shown in Figure 4, the decoder generates atis flight NT. Here, the N T symbol stands for a non-terminal. When a non-terminal is predicted, the subsequent symbol or token is predicted by applying the decoder to the hidden vector representation of the non-terminal. Table 1 walks through this process with an example. Each of the predicted N T s enter a queue and are expanded when popped from the queue. This process continues until no more N T s are left to expand. The loss function is, log π T D (z s t |z s <t , z s , x) (2) S refers to the size of the queue for a given training example. T s refers to the number of nodes (or children) to be generated for a non terminal in the queue, z s . z s t represents the t th child of the non terminal z s . z s <t refers to left siblings of z s t . Children of z s are generated conditioned on the hidden vector of z s and the left siblings of that child.
The tree decoder is initialized with the [CLS] representation of the BERT encoder. The tree decoder generates templates which are then filled with slot values from the user's utterance. In the example, atlanta and pittsburgh are replaced by $city name, september is replaced by $month name and fourth is replaced by $day name during training. The $ symbol indicates a terminal.

Pointer Network:
We predict positions for every terminal, pointing to a specific token in the user's utterance. We perform element-wise multiplication between the terminal node's hidden representation (h) and the encoder representations (e) obtained from the encoder. This is followed by a feed forward layer (g) and a dense layer to finally assign probabilities to each position (p) in the input utterance. That is, The pointer network loss, loss P T , is the categorical cross entropy loss between p t and the true positions. The four components are trained jointly end-to-end to minimize a total loss, loss −G = loss SL + loss T + loss P T (4)

Global Context
We found that the tree decoder tends to repeat nodes, since representations may remain similar from parent to child. We overcome this by providing global supervision. This global supervision does not consider the order of nodes, but rather rewards predictions if a specific node is present or not in the final tree. If the model fails to predict that a node is present, the model is penalized based on the number of times it appears in the reference (or ground truth) tree. Say, z 1 , ...z K is the unique set of nodes present in the reference tree and N (z k ) is the number of times node z k occurs in the reference. The representation of the [CLS] token is used to predict the presence of these nodes with the loss function, ROOT ( atis flight ( fromloc ( city name ( ) ) toloc ( city name ( ) ) depart date ( ) ) ) N T5 month name N T8 [N T6, N T7, N T8, N T9] ROOT ( atis flight ( fromloc ( city name ( ) ) toloc ( day name N T9 city name ( ) ) depart date ( month name ( ) day name ( ) ) ) ) N T6 $city name [N T7, N T8, N T9] ROOT ( atis flight ( fromloc ( city name ( $city name ) ) toloc ( city name ( ) ) depart date ( month name ( ) day name ( ) ) ) ) N T7 $city name [N T8, N T9] ROOT ( atis flight ( fromloc ( city name ( $city name ) ) toloc ( city name ( $city name ) ) depart date ( month name ( ) day name ( ) ) ) ) N T8 $month name [N T9] ROOT ( atis flight ( fromloc ( city name ( $city name ) ) toloc ( city name ( $city name ) ) depart date ( month name ( $month name ) day name ( ) ) ) ) N T9 $day name [∅] ROOT ( atis flight ( fromloc ( city name ( $city name ) ) toloc ( city name ( $city name ) ) depart date ( month name ( $month name ) day name ($day name ) ) ) ) Table 1: Actions taken to generate the frame representation of the sentence, from pittsburgh i'd like to travel to atlanta on september fourth. "NT" refers to non-terminals.
with overall loss,

Datasets and Results
We start with ATIS, the only public dataset that has even a shallow hierarchy. The ATIS dataset contains audio recordings of people requesting flight reservations, with 21 intent types and 120 slot labels. There are 4,478 utterances in the training set, 893 in the test set and 500 utterances in the development set. We transform the ATIS dataset to the bracketed tree format (Figure 3). We also evaluate the proposed approach using a simulated ordering dataset (example in Figure  3). The dataset contains 2 intents and 7 slot labels, 4767 training examples, 1362 test examples and 681 development examples. We manually created templates for every intent (i.e, OrderItems, GetTotal). An intent is randomly sampled, then a template along with a number of items and slot values for each of the properties of the items are randomly drawn to generate an utterance and a bracketed representation for the utterance 2 . 2 The modified ATIS and simulated datasets are available as part of Supplementary material.

Evaluating the proposed approach
We evaluate both the generalized and the specific forms generated by the proposed model ( Figure 5) in Table 2. The exact match criteria requires that the predicted tree completely match the reference tree. As this metric does not assign any credit to partial matches, we also compare all parent child relationships between the reference and the predicted trees and compute micro-f1 scores (Lipton et al., 2014). Specific:( atis_flight ( fromloc ( city_name ( $city_name ) ) toloc ( city_name ( $city_name) ) depart_date ( month_name ( $month_name ) day_name ( $day_name ) ) ) ) Generalized: ( atis_flight ( fromloc ( city_name ( pittsburgh ) ) toloc ( city_name ( atlanta) ) depart_date ( month_name ( september ) day_name ( fourth ) ) ) ) To measure the benefit of the weighted G loss, we also evaluate an unweighted G loss function, log π G (z k |x) (7) As seen in Table 2, the best performance both on f-measure and accuracy is obtained with the weighted G loss function.

Baseline: Extending flat representations with group information
We also compare with a reasonable baseline that extends the traditional flat structured frame ( Figure  1) in a way that captures hierarchies. We learn to predict group information along with the slot labels (Baseline in Table 3) by appending indices to the labels that indicate which group the slot label belongs to. Consider, i want to fly from milwaukee to orlando on either wednesday evening or thursday morning. This example requires capturing two groups of information as shown in Figure  6. Group0 contains all the necessary pieces of information for traveling on wednesday evening and Group1 contains information for traveling on thursday morning. As shown, milwaukee and orlando are present in both the groups. We can represent the two day names (and period of day) with Batis flight.depart date.day name0 and Batis flight.depart date.day name1.
We can then use B-atis flight.fromloc.city name01 and B-atis flight.toloc.city name01 to indicate that they belong to both the groups. Such an approach increases the number of unique slot labels, resulting in fewer training examples for each slot label, but allows multi-class classification methods from prior work to be used as is.
We then train and test the model using the approach that provided highest slot labeling scores which used BERT (Chen et al., 2019). We also convert the generated output of the hierarchical method proposed in this paper to the flat format above. Note, the f1 scores we obtain here are different from those reported in Table 2 as here we only consider the most specific label (eg. B-atis flight.toloc.city name01) as the true slot label for a token versus the f1 measure over all the parent child relationships in Table 2. Since adding group information increases the number of unique slot labels, the results reported for the Baseline are different from what has been reported in (Chen et al., 2019).
We notice a large improvement with the proposed approach on the simulated dataset. This implies that modeling hierarchical relationships between slot labels via a tree decoder is indeed helpful. The small improvement we see on ATIS can be attributed to the fact that only a small fraction of the test data required grouping information (≈ 1.7%).

Conclusion and Future Work:
With this preliminary work, we showed cases where traditional flat semantic representations fail to capture slot label dependencies and we highlighted the need for deep hierarchical semantic representations for dialog frames. The proposed recursive, hierarchical frame-based representation captures complex relationships between slots labels. We also proposed an approach using a templatebased tree decoder to generate these hierarchical representations from users' utterances. We also introduced global supervision by extending the treebased loss function, and showed that it is possible to learn all this end-to-end.
As future work, we are extending the proposed approach and test its efficacy on real human conversations. More broadly, we continue to explore strategies that combine semantic parsing and neural networks for frame generation.  Table 3: Comparing slot-label f1 scores of the Proposed approach and Baseline.