Span-based Hierarchical Semantic Parsing for Task-Oriented Dialog

We propose a semantic parser for parsing compositional utterances into Task Oriented Parse (TOP), a tree representation that has intents and slots as labels of nesting tree nodes. Our parser is span-based: it scores labels of the tree nodes covering each token span independently, but then decodes a valid tree globally. In contrast to previous sequence decoding approaches and other span-based parsers, we (1) improve the training speed by removing the need to run the decoder at training time; and (2) introduce edge scores, which model relations between parent and child labels, to mitigate the independence assumption between node labels and improve accuracy. Our best parser outperforms previous methods on the TOP dataset of mixed-domain task-oriented utterances in both accuracy and training speed.


Introduction
Most commercial conversational AI systems parse task-oriented utterances using intent classification and slot-filling models (He and Young, 2003;Raymond and Riccardi, 2007;Mesnil et al., 2015), where the intent is the task of the utterance (e.g., IN:GET DIRECTION) and the slots are the parameters needed to complete the task (e.g., SL:DESTINATION). This limited representation typically allows only a single intent per utterance and at most one slot label per token. Dialog systems using such a flat representation would struggle to handle compositional tasks that involve invoking multiple backend services (e.g., "direction to John's party": find John's address, and then find the direction to that address).
To support compositional utterances, the hierarchical Task Oriented Parsing (TOP) representation has recently been introduced . As illustrated in Figure 1, the TOP representation * work done while at Facebook Assistant. is a tree where intents and slots are nested alternatively to model composition. The values inside intent and slot subtrees can then be used by downstream dialog modules to invoke appropriate services in a hierarchical fashion.
We propose a span-based semantic parser for parsing utterances into the TOP representation. In its most basic form, the parser embeds each token span (e.g., x 2:5 = "John 's party" in Figure 1) as a vector, and then uses it to predict the labels of the tree nodes covering the span (e.g., SL:DESTINATION and IN:FIND EVENT). While the label prediction is done independently for each span, a CKY decoding algorithm is used to decode a valid tree with the maximum tree score.
Our main contributions are twofold. First, we reinterpret the ad-hoc tree score in previous spanbased parsing work (Stern et al., 2017;Gaddy et al., 2018;Kitaev and Klein, 2018) as a joint distribution over the labels. Under this new framework, the loss function factors nicely, which allows us to train the model in a highly parallelized fashion instead of having to run the computation-ally expensive decoder during training. Second, we introduce edge scores to model the relationship between parent and child nodes in the tree. This mitigates the independence assumption and allows the decisions at child nodes to influence the parent node.
We evaluate our models on the TOP dataset  of compositional utterances on events and navigation domains. Most utterances have nested intents, with some utterances containing intents from different domains. We demonstrate that our model is fast to train and outperforms previous models on the dataset. 1 2 Related work Neural tree generation. Previous work on syntactic and semantic parsing employs different strategies for generating trees. Approaches such as transition-based parsers Liu and Zhang, 2017) and top-down tree generation (Vinyals et al., 2015;Choe and Charniak, 2016;Dong and Lapata, 2016;Krishnamurthy et al., 2017;Yin and Neubig, 2018) frame tree generation as predicting a sequence of actions for generating the tree. The decoding processing generally consists of local decisions, and beam search is used to retain uncertainty. In contrast, global decoding methods (Durrett and Klein, 2015;Lee et al., 2016) can incorporate non-local features and decode a tree with the maximum global score.
Span-based models. Span-based models use the embeddings of token spans to perform prediction. By capturing the properties of the whole phrase instead of individual words, span embeddings can be suitable for tasks where phrases are the basic unit. Indeed, span-based models have recently shown great promise by achieving state-ofthe-art results for segmentation and tagging (Kong et al., 2016), coreference resolution (Lee et al., 2017) and semantic role labeling (He et al., 2018).
For syntactic parsing, span embeddings have been used to score actions in shift-reduce parsing (Cross and Huang, 2016), or to score the existence and labels of tree nodes in bottom-up and top-down parsing (Stern et al., 2017;Kitaev and Klein, 2018). The analysis by Gaddy et al. (2018) shows that the span embeddings can learn to capture various information, such as label correlation and structural agreement, which was traditionally modeled by grammars or lexical features.

Setup
Given an utterance x = (x 0 , . . . , x n−1 ) with n tokens, the task is to predict a TOP parse tree, as illustrated in Figure 1. Each leaf node corresponds to a token x i , while each non-terminal node covers some span (i, j) with tokens x i:j = (x i , x i+1 , . . . , x j−1 ). The label l of each nonterminal node is either an intent (prefixed with IN:) or a slot (prefixed with SL:). Two type constraints for TOP trees include: (1) intents can only have slots as children and vice versa, and (2) the tree root covering span (0, n) must be an intent.
For our span-based parser, it is helpful to view the tree as a mapping T from each span (i, j) to a chain c = (l 1 , . . . , l k ) which lists the k ≥ 0 labels in the unary chain covering the span (i, j). For instance, the span (2, 5) in Figure 1 has T [2, 5] = (SL:DESTINATION, IN:FIND EVENT). If no nonterminal node covers the span, we denote the empty chain as c = ∅.
Following previous span-based parser work (Cross and Huang, 2016;Stern et al., 2017), we only consider unary chains that appear in the training data to be valid chains. (In our experiments, the training data has 57 distinct labels and 135 distinct chains. Only a single chain in the test data is not covered in training data.)

Model
Our parser is based on a joint distribution over the chain at each span. Let the probability of span (i, j) having chain c be a softmax over the possible chains: .
(1) (The conditioning on x is omitted.) Here, the node score f n (x i:j , c) is a real-valued function with f n (x i:j , ∅) fixed as 0. To compute the node scores for a span x i:j , we embed the span using an LSTM-based span embedder 2 from Lee et al. (2017), and then apply a feedforward network to produce a real-valued score for each chain c = ∅. We now make a simplifying assumption that the values of T [i, j] are independent. The log-likelihood of the entire mapping T becomes . (2) The log-sum-exp term log c exp f n (x i:j , c ) does not depend on T . During inference, maximizing the log-likelihood log p(T ) is thus equivalent to maximizing the tree score: At test time, we use CKY chart parsing to decode a valid TOP tree T with the maximum score s(T ).
Training. The tree score s(T ) turns out to be the same scoring function used in previous span-based constituency parsing models (Stern et al., 2017;Gaddy et al., 2018;Kitaev and Klein, 2018). To train the model, these previous works use margin loss: (4) where T * is the gold tree, T is the predicted tree and ∆ is a distance function. Computing the max T term requires running a cost-augmented CKY decoder, which is computationally expensive and difficult to parallelize, especially when we add edge scores as described in the next section.
Instead, we propose to directly maximize the log-likelihood in Equation 2 of the gold trees in the training data. Concretely, given a gold tree T * , our loss function is the negative loglikelihood − log p(T * ), which decomposes into a cross-entropy loss for each span (i, j) according to Equation 2. We train our model by directly minimizing these cross-entropy loss terms; in other words, we treat the model as a multiclass classification model where the input is the span x i:j and the classes are the possible chains for the span.
Our training method is faster than using margin loss since it does not require running the CPUbound CKY decoder during training. Moreover, the cross-entropy loss terms can be computed in parallel for all spans of multiple examples at once, which leads to further speed-up.
Class weight for empty chains. In practice, the number of spans (i, j) with gold chains T * [i, j] = ∅ (i.e., no subtree covering the span) is large compared to the total number of spans. To avoid the class imbalance problem, we scale the crossentropy loss terms for spans with T * [i, j] = ∅ by a hyperparameter α < 1.

Edge scores
The model so far scores each span independently, which can be sub-optimal. For example, the prediction of the top intent only depends on the embedding of x 0:n , which is equivalent to standard intent classification or the first decision of a topdown tree generation. Similarly, ontology constraints (which intents can take which slots) are also not captured by the model.
To allows child nodes to influence the decision of the parent at inference time, we introduce edge scores. For a span (i, j) with T [i, j] = c = (l 1 , . . . , l k ), consider the parent node of the topmost non-terminal l 1 , and define π[i, j] to be its label. We model the conditional distribution of π[i, j] as a softmax over all possible labels: where f e (x i:j , c, l) is the edge score. Similar to node scores, we compute f e (x i:j , c, l) by applying a feed-forward network on the concatenation of the embeddings of x i:j and c. For convenience of notation, we let π[i, j] = ∅ when the parent does not exist (i.e., when T [i, j] = ∅ or (i, j) = (0, n)). In those situations, we let f e (x i:j , c, l) be 0 for l = ∅ and −∞ otherwise. We define the joint log-likelihood of T and π: Unlike in the original model, the last log-sumexp term in Equation 6 does depend on the tree T , and thus has to be included in the tree score. Our new tree score is where p(π[i, j] | T [i, j]) is the softmax over edge scores as defined in Equation 5. Note that we cannot replace node scores f n (x i:j , T [i, j]) with their log-softmax this way since we want dummy nodes to contribute a score of 0 to the tree score. To train the model, we again directly minimize the negative log-likelihood of the gold tree T * . The loss function factors into a cross-entropy loss term for each span (i, j) and for each edge in T * .
Discussion. The joint modeling of T [i, j] and π[i, j] is closely related to the parent annotation technique in syntactic parsing (Johnson, 1998;Klein and Manning, 2003;Petrov et al., 2006), where certain non-terminal labels are split into multiple labels based on their parents in the grammar rule (e.g., VP with parent S becomes VP^S). In our framework, we can view the label and edge scores (Equation 7) as a score over parentannotated labels (T [i, j], π[i, j]): where f a (x i:j , ∅, ∅) = 0. From our preliminary experiments, we found that modeling f a as a combination of label and edge scores, as in Equation 7, is empirically better than directly applying a softmax over possible pairs (T [i, j], π[i, j]). One analysis in Gaddy et al. (2018) shows that the span embedding of x i:j is powerful enough to identify the parent label π[i, j] with high accuracy. However, in TOP semantic trees, parent and child nodes can have span boundaries that are far apart (e.g., the top intent covering span (0, 15) might have a child slot covering span (7, 9)). As an alternative to increasing the power of the span embedder to handle long-range relations, which could lead to overfitting, adding edge scores is a simpler way to model relations between labels.

Experiments
We evaluate our models on the TOP dataset  of hierarchical intent-slot annotations for utterances in navigation and event domains. 3 We use the dataset version with the noisy IN:UNSUPPORTED * intents excluded, which contains 28410 training, 4032 development, and 8241 test examples. The main evaluation metric is exact match accuracy: the fraction of predicted parses that exactly match the annotated parses. We also report the labeled bracket F1 score for the parse tree constituents (Black et al., 1991).
Training details. We highlight crucial modeling decisions of our models and defer other details to Appendix B.
• For a fair comparison, we use the same token and sequence embedders as : 200-dimensional GloVe embeddings and 2-layer 164-dimensional biLSTMs. We expect contextual embeddings (Peters et al., 2018;Devlin et al., 2018) and a transformerbased sequence embedder (Kitaev and Klein, 2018) to further improve the results, as observed in Einolghozati et al. (2018).
• The class weight α for empty chains is tuned on the development data: α = 0.4 for the basic model and α = 0.2 for the model with edge scores. We will later demonstrate how different choices of α affect the results.
Baselines. We compare our method to existing approaches proposed for this task in : a shift-reduce parser based on Recurrent Neural Network Grammars (RNNG) , and the best sequence-to-sequence model that produces linearization of the trees based on CNN utterance embeddings (Gehring et al., 2017). 4 We also consider the span-based parsers by Stern et al. (2017) with the additional improvements from Gaddy et al. (2018). 5 The parsers were designed for constituency parsing, are trained with cost-augmented margin loss, and use either a bottom-up CKY decoder (Stern-chart) or a top-down greedy decoder (Stern-greedy).
Accuracy. Table 1 shows the exact match accuracy and bracket scores on the test data. The spanbased model without edge scores is comparable to the baseline RNNG model. With edge scores, the span-based model outperforms the baselines in both exact match accuracy and labeled F1 scores.  Training speed. Our span-based models can be trained in a highly parallelized fashion without having to run the computationally expensive decoder. Table 1 shows the average wall-clock time used to train the model for one epoch over the training data.
Error analysis. We compare the errors made by the span-based parsers on the development set. Table 2 provides a breakdown of the error counts of the models on development data.
Recall trade-off. As described in Section 4, the hyperparameter α controls the weight of the loss terms for the class c = ∅ (i.e., not building a subtree for the span). As such, we can use α to trade off two types of errors: missing nonterminals (over-predicting c = ∅) and spurious non-terminals (under-predicting c = ∅). As shown in Figure 2, higher α encourages the model to predict c = ∅ more frequently, leading to more missing non-terminals (intents and slots) and fewer spurious ones. We can also tune α based on the downstream tasks. For instance, getting a higher slot recall using a small α is arguably better for dialog systems since other downstream modules (e.g., entity linker) can detect and discard spurious slots. When α is low, one common type of errors is when the models incorrectly split a non-terminal. This usually happens when the gold slot consists of two sub-phrases that can be interpreted as slots on their own (e.g., a SL:CATEGORY EVENT slot "holiday concerts" is split into a SL:DATE TIME slot "holiday" and a SL:CATEGORY EVENT slot "concerts"). As the tree score is the sum of span scores, the parser is biased toward creating two non-terminal nodes instead of one. Luckily, in the context of task-oriented dialogs, this type of errors tends to have small effects on the semantic interpretation, and sometimes even provides more useful information for downstream modules.

Conclusion
We presented the first span-based parser for parsing utterances into the hierarchical intent-slot representation. The log-likelihood objective allows us to train the model without having to decode a tree in a highly parallelized fashion, while edge scores can explicitly capture the parent-child relationship even when their boundaries are far apart.
Apart from standard accuracy improvement techniques such as better token embeddings and ensembling, possible future directions include a more fine-grained control of the recall trade-off, modeling the tokens outside the non-terminals instead of ignoring them, incorporating the parent's embedding in edge scores, and a more efficient or approximate decoder similar to the greedy decoder from Stern et al. (2017).