Conversational Semantic Parsing

The structured representation for semantic parsing in task-oriented assistant systems is geared towards simple understanding of one-turn queries. Due to the limitations of the representation, the session-based properties such as co-reference resolution and context carryover are processed downstream in a pipelined system. In this paper, we propose a semantic representation for such task-oriented conversational systems that can represent concepts such as co-reference and context carryover, enabling comprehensive understanding of queries in a session. We release a new session-based, compositional task-oriented parsing dataset of 20k sessions consisting of 60k utterances. Unlike Dialog State Tracking Challenges, the queries in the dataset have compositional forms. We propose a new family of Seq2Seq models for the session-based parsing above, which achieve better or comparable performance to the current state-of-the-art on ATIS, SNIPS, TOP and DSTC2. Notably, we improve the best known results on DSTC2 by up to 5 points for slot-carryover.


Introduction
At the core of conversational assistants lies the semantic representation, which provides a structured description of tasks supported by the assistant. Traditional dialog systems operate through a flat representation, usually composed of a single intent and a list of slots with non-overlapping content from the utterance (Bapna et al., 2017;. Although flat representations are trivial to model with standard intent/slot tagging models, the semantic representation is fundamentally limiting.  explored the limitations of flat representations and proposed a compositional generalization which allowed slots to contain nested intents while allowing easy modeling through neural shift-reduce parsers such as RNNG (Dyer et al., 2016).
Our contributions are the following: • We explore the limitations of this compositional form and propose an extension which overcomes these limitations that we call decoupled representation.
• To parse this more complicated representation, we propose a family of Seq2Seq models based off the Pointer-Generator architecture that set state of the art in multiple semantic parsing and dialog tasks (See et al., 2017).
• To further advance session based task oriented semantic parsing, we release a publicly available set with 60k utterances constituting roughly 20k sessions.
of the compositional semantic representation must reconstruct the utterance. Following this constraint it is possible to use discriminative neural shift reduce parsers such as RNNG to parse into this form (Dyer et al., 2016). Although at face value this constraint seems reasonable, it has non-trivial implications for both the semantic parsing component (NLU) and downstream components in conversational assistants.

Surpassing Utterance Level Limitations with Decoupled Form
First we'll take a look at the space of utterances that can be covered by the compositional representation. One fundamental problem with the in-order constraint is that it disallows long-distance dependencies within the semantic representation. For example, the utterance On Monday, set an alarm for 8am. would optimally have a single date-time slot: [SL DATETIME 8am on Monday]. But, because 8am and on Monday are at opposite ends of the utterance, there is no way to construct a semantic parse tree with a single datetime slot.  mentioned this problem, but had some empirical data showing that utterances with long-distance dependencies are rare in English. Although this might be true, having fundamental limitations on what type of utterances can be supported even with a complete ontology is concerning. In English, discontinuities are restricted in occurrence, despite emerging naturally within certain patterns, because English is a configurational language, which uses strongly marked word order to impart some level of semantic information (Chomsky, 1981). Beyond English, however, there are numerous world languages that are non-configurational and have much freer or potentially completely free word order. Non-configurational languages may often present the same semantic information through the use of Case Markers, Declensions, or other systems. The relatively free word order this allows creates much less emphasis on the collocation of a semantic unit's to-kens. Therefore, as conversational assistants progress toward multiple languages it's important to consider that constraints that are acceptable if only English is considered will not analogously scale to other languages.
A simple solution is to convert a standard compositional intent-slot parse into a logical form containing two label types (slot and intent), with no constraints over intent spans. This is trivially accomplished by removing all text in the compositional semantic parse that does not appear in a leaf slot. We call this form of semantic parse the decoupled semantic representation, due to the semantic representation not being tightly coupled with the original utterance. Figure 1 shows a side by side example of compositional and decoupled semantic representations for the utterance Please remind me to call John.

Session Based Limitations
Because traditional conversational systems historically have had a clear separation between utterance level semantic parsing and dialog systems (which stitch together utterance level information into sessions), semantic representations have not focused on sessionbased representations. Integrating session information into semantic parsers has been limited to refinement-based approaches. Figure 2 shows an example of refinement and informationally complete based approaches to semantic parsing. The refinement approach delegates responsibility of sessionbased semantic parsing to a separate dialog component. Consequently, refinement approaches tend to have a very limited ontology due to the semantic parser operating over a fixed input (non-session utterances).
Predicting what slot to use for refining works for flat semantic representations, but it is non-trivial to extend to compositional or decoupled. The position of a slot in a flat semantic representation is not meaningful, thus it is sufficient to only predict the slot without specifying its position in the parse. But  both compositional and decoupled extensions to intent-slot parsing vary semantically by the position of the slot (or nested intent).
We present an example in Figure 3. Given the followup utterance remind me to call, a classical system would need to carry over the whole CONTACT slot, but the question is to where? The semantic parse is not flat. The slot could be carried over to the CREATE REMINDER intent or the nested GET CONTACT intent. So, if we were to extend classical slot carryover, we not only would need to predict what slot to carry over from the conversation, but what intent within the current semantic parse to place it under. We propose a new paradigm that does joint classical semantic parsing with co-reference resolution and slot-carryover.

Session Based Semantic Parsing
We present a simple extension to the decoupled paradigm of intent-slot semantic parsing by introduction of a new reference (REF) label type. The REF label type contains two elements in its set to represent co-references and slot-carryover as separate operations. Coreferences can be seen as an explicit reference, namely a reference conditioned on an explicit word, while slot-carryover is treated as an implicit reference (conditioned by relevant contextual information).
As an example, refer to the sample session with decoupled semantic parses in Figure 4 3 Model

Sequence-to-Sequence Architecture
The decoupled semantic parsing model is an extension of the very common sequence-tosequence learning approach (Sutskever et al., 2014), with the source sequence being the utterance and the target sequence being a linearized version of the target tree. Trees are linearized by bracketing them, using the same approach as Vinyals et al. (2015). The decoupled tree in Fig. 1b After tokenization, an encoder processes the source tokens w i and produces corresponding encoder hidden states: where the encoder, in our experiments, is either a standard bidirectional LSTM or a transformer.
In spite of its drawbacks, the rigid structure of the compositional semantic trees (Fig.  1a) has the advantage of readily mapping to the RNNG formalism and its inductive biases. The decoupled semantic representation, being more flexible, does not have such an easily exploitable form -but we can still exploit whatever structure exists. The tokens of the linearized decoupled representation (the target sequence) can always be divided into two classes: utterance tokens that are already present in the source sequence -which form the leaves of the tree -and ontology symbols. Taking again the example tree of Fig.  1b, me, call, and John are all tokens from the utterance, while [IN:CREATE REMINDER, [SL:PERSON REMINDED, ], etc., are ontology symbols. This partition is reflected in the structure of the decoder: at every decoding step, the model can either generate an element from the ontology, or copy a token from the source sequence via a mechanism analogous to the pointer-generator network of See et al. (2017). At decoding time step t, the decoder is fed with the encoder's outputs and produces a vector of features x t , which is used to compute an ontology generation distribution p g t : x t = Decoder (e 1 , ..., e t ; d t−1 ; s t−1 ) , where d t−1 is the previous output of the decoder, s t−1 is the decoder's incremental state, and Linear θ [x] is short-hand for an affine transformation with parameters θ, i.e. W θ x + b θ . The decoder's features are also used to calculate the attention distribution -using multi-head attention (Vaswani et al., 2017) -which then serves to produce the utterance copy distribution p c t : p c t , ω t = MhAttention (e 1 , ..., e t ; Linear c [x t ]) , 1+e −x is the standard sigmoid function, indicates concatenation, and MhAttention indicates multi-head attention which returns, respectively, the attention distribution and its weights. Finally, the extended probability distribution is computed as a mixture of the ontology generation and utterance copy distributions:

Encoder and Decoder
We experiment with two main variants of the decoupled model: one based on recurrent neural networks, and one based on the transformer architecture (Vaswani et al., 2017). RNN Our base model uses two distinct stacked bidirectional LSTMs as the encoder and stacked unidirectional LSTMs as the decoder. Both consist of two layers of size 512, with randomly initialized embeddings of size 300. The base model is optimized with LAMB while others are optimized with Adam, using parameters β 1 = 0.9, β 2 = 0.999, = 10 −8 , and L2 penalty 10 −5 (Kingma and Ba, 2014). The learning rate is found separately for each experiment via hyperparameter search. We also use stochastic weight averaging (Izmailov et al., 2018), and exponential learning rate decay. For an extended version of this model, we also try incorporating contextualized word vectors, by augmenting the input with ELMo embeddings (Peters et al., 2018).
Transformer We also experiment with two further variants of the model, that replace encoder and decoder with transformers. In the first variant, the encoder is initialized with RoBERTa (Liu et al., 2019), a pretrained language model. The decoder is a randomly initialized 3-layer transformer, with hidden size 512 and 4 attention heads. In the second variant, we initialise both encoder and decoder with BART (Lewis et al., 2019), a sequence-tosequence pretained model. Both encoder and decoder consist of 12 layers with hidden size 1024. We train these with stochastic weight averaging (Izmailov et al., 2018), and determine optimal hypermarameters on the validation sets.

Session Based Task Oriented Parsing
To incentivize further research into session based semantic parsing through the decoupled intent-slot paradigm we are releasing 20 thousand annotated sessions in 4 domains: calling, weather, music and reminder. We also allow for mixtures of domains within a session. The data was collected in two stages. First we asked crowdsourced workers to write sessions (both from the users perspective as well as the Assistant's output) tied to certain domains. Once we vetted the sessions, we asked a second group of annotators to annotate the user input per session. Each session was given to three separate annotators. We used majority voting to automatically resolve the correct parse when possible. In the cases where there was no agreement, we selected the maximum informative parse which abode by the labeling representations semantic constraints. The annotator agreement rate was 55%, while our final chosen semantic parses were correct 94% of the time. The large delta between the two numbers is due to multiple correct semantic parses existing for the same session.
We open source SB-TOP in the following link: http://dl.fbaipublicfiles. com/sbtop/SBTOP.zip. More information about the dataset can be found in the Table 4 in the Appendix.

Semantic Parsing
We evaluate the decoupled model on five semantic parsing datasets, four public and one internal. All but two are annotated with compositional semantic representations and the other with the standard flat intent-slot representation. In order to apply the decoupled models to them, we follow a mechanical procedure to transform the annotations to decoupled representations: all utterance tokens which are not part of a slot are stripped. This procedure effectively turns the tree of Fig. 1a into the tree of Fig. 1b. We note that this procedure for all compositional and flat intent-slot data available is reversible, therefore we can convert from decoupled back to source representation.
The first public dataset is TOP , which consists of over 31k training utterances covering the navigation, events, and navigation to events domains. The first internal dataset we use contains over 170k training utterances annotated with flat representations, covering over 140 distinct intents from a variety of domains including weather, communication, music, weather, and device control. The second internal dataset contains over 67k training utterances with fully hierarchical representations, and covers over 60 intents all in the communication domain.
The second and third public datasets are SNIPS Natural Language Understanding benchmark1 (SNIPS-NLU) and the Airline Travel Information Systems (ATIS) dataset (Hemphill et al., 1990). We follow the same procedure that was mentioned above for preparing the decoupled data for both of these datasets.
As can be seen from Table 1b, our proposed approach outperforms the previous state-ofthe-art results on the ATIS, comparable to state-of-the-art on SNIPS, and TOP semantic parsing task, which had been obtained with the Seq2SeqPtr model by Rongali et al. (2020). Comparing the decoupled model to RNNGs, we note that a single decoupled model, using either biLSTMs or transformers (with RoBERTa or BART pretraining) is able to outperform the RNNG. In fact, the decoupled model even outperforms an ensemble of seven RNNGs. The decoupled biLSTM extended with ELMo inputs is able to outperform the transformer model initialised with RoBERTa pretraining. However, the best performance is achieved by using the transformer model with BART-large pretraining, with the decoupled model fine-tuned jointly on top of it . In order to understand how much of these gains are due to the semantic representation, we perform an ablation study by evaluating the biLSTM and RoBERTa-based models on TOP data using the standard logical form representation, and find a drop in frame accuracy of 0.32 and 0.55 respectively.
The TOP dataset contains to the order of 30k examples in its training set. In order to further tease out the differences between the biLSTM and transformer approaches, and to see how they compare when more training data is available, we also evaluate these models on our two larger internal datasets. Table 1c shows that the RoBERTa-based model does indeed benefit from the extra training data, being able to outperform the biLSTMbased model on the two datasets. In both cases, the decoupled model with BART pretraining achieves the top performance.
The same procedure was used over our SB-TOP dataset, with the only variant being we concatenated SB-TOP and TOP and jointly trained over both datasets. Table 2 shows the test results.

Slot carryover
To evaluate the ability of the decoupled models to work on session-based data, we evaluate them on a task which requires drawing information for multiple utterances. The DSTC2 dataset (Henderson et al., 2014) contains a number of dialogues annotated with dialogue state -slightly over 2k sessions in  the training set. They involve users searching for restaurants, by specifying constraints such as cuisine type and price range. Given that users will often take multiple turns to specify all constraints, determining the correct dialogue state requires the model to consider all past turns too. Consider the example of the two-turn DSTC2 session shown in Figure 5: the [SL:AREA south ] slot, introduced in the first session, is said to carry over to the second session as it still applies to the dialogue state, despite not being explicitly mentioned. 1 To make previous utterances available to the model, we use a simple approach: all utterances are concatenated, with a separator token, and are fed to the encoder.
The decoupled models are evaluated on frame accuracy and slot carryover -the fraction of slots correctly carried over from one turn to the next. Carryover figures are split by slot distance: how many turns prior to the current one the slot under consideration first appeared. As shown in Table 3, the RoBERTa decoupled model outperforms the biLSTM model on frame accuracy, while the biLSTM model takes the lead in terms of raw slot carryover performance. BART outperforms both, achieving the best overall performance.
For informative purposes, we also include results from standard dialogue state tracking models. The results show that the decoupled models, despite not being specifically designed for the task of dialogue state tracking, compare favorably to other approaches in the literature. While our models outperform them on most metrics, it should be noted that they

Related Work
Traditional work on semantic parsing, either for the purposes of question answering or taskoriented request understanding, has focused on mapping utterances to logical form representations (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Kwiatkowksi et al., 2010;Liang, 2016;van Noord et al., 2018). Logical forms, while very expressive, are also complex. Highly trained annotators are required for the creation of training data, and as a result there is a lack of large scale datasets that make use of these formalisms. Intent-slot representations such as those used for the ATIS dataset (Price, 1990) or the datasets released as part of the DSTC challenges (Henderson et al., 2014;Rastogi et al., 2019) have less expressive power, but have the major advantage of being simple enough to enable the creation of large-scale datasets.  introduce a hierarchical intent-slot representation, and show that it is expressive enough to capture the majority of user-generated queries in two domains.

Conclusions
We started this paper by exploring the limitations of compositional intent-slot representations for semantic parsing. Due to the constraints it imposes, it cannot represent certain utterances with long-term dependencies, and it is unsuitable for semantic parsing at the session (multi-utterance) level. To overcome these limitations we propose an extension of this representation, the decoupled representation. We propose a family of sequenceto-sequence models based on the pointergenerator architecture -using both recurrent neural network and transformer architectures -and show that they achieve top performance on several semantic parsing tasks. Further, to advance session-based task-oriented semantic parsing, we release to the public a new dataset of roughly 20k sessions (over 60k utterances).