Identifying Domain Independent Update Intents in Task Based Dialogs

One important problem in task-based conversations is that of effectively updating the belief estimates of user-mentioned slot-value pairs. Given a user utterance, the intent of a slot-value pair is captured using dialog acts (DA) expressed in that utterance. However, in certain cases, DA’s fail to capture the actual update intent of the user. In this paper, we describe such cases and propose a new type of semantic class for user intents. This new type, Update Intents (UI), is directly related to the type of update a user intends to perform for a slot-value pair. We define five types of UI’s, which are independent of the domain of the conversation. We build a multi-class classification model using LSTM’s to identify the type of UI in user utterances in the Restaurant and Shopping domains. Experimental results show that our models achieve strong classification performance in terms of F-1 score.


Introduction
An important part of dialog management in dialog systems is to detect the type of update to be performed for a slot after every turn in order to keep track of the dialog state. (The dialog state reflects the user goals specified as slot-value pairs.) User dialog acts (Young, 2007) express the user's intents towards slots mentioned in the conversation. They are extracted in the spoken language understanding (SLU) module and are utilized by the downstream state tracking systems to update belief estimates (Williams et al., 2013;Lee and Stent, 2016;Henderson et al., 2014c). However, * The work was done when the author was at Yahoo Research, Oath Inc. currently used dialog acts do not capture the update intended by the user in the following cases: 1. Implicit denials: User denials for slot-values are expressed using the "deny" and "negate" dialog acts. However, these acts only address explicit negations/denials such as "no", "I do not want slot-value '. But a user may express denial for a value implicitly. Consider utterances 8 and 9 in Table 1 where a user adds and removes people from a slot, PNAMES, which contains names of people going to an event. Current SLU systems would detect the "inform" dialog act in both utterances and, hence, would miss the (implicit denial) "remove" update. 2. Updates to numeric slots: Numeric slots are the slots whose values can be increased and decreased in addition to getting set/replaced. Since dialog acts do not capture the "increase" and "decrease" intents towards a numeric value, such updates cannot be handled using dialog acts alone. For example, consider utterances 4, 5 and 6 in Table 1 where the value of a numeric slot, NGUEST (number of guests in an invite), is set, increased and decreased respectively. The dialog act expressed in these utterances is "inform" which does not convey the update type. 3. Preference for slot values: The "inform" dialog act specifies values for slots but does not take into account the preferences for any particular slot value(s). Consider utterances 1, 2 and 3 in Table 1 where the location slot (LOC) is referred. In utterance 2, the user is equally interested in the three locations ("Ross", "Napa" and "San Jose"). However, in utterance 3 the user prefers "Gilroy" over other values and intends to replace the old values with "Gilroy". Clearly, the SLU output does not capture this change in the user intent.
We posit that identifying the above intents in user utterances as a part of SLU would improve estimation of user goals in task based dialogs. To ad-   Table 2 defines the five UI's. We model the problem of identifying UI's as a multi-class classification. For a user utterance, we classify UI's for all the slot-values present in the utterance into one of the five classes. We treat an utterance as a sequence of tokens and slot-values, and perform sequence labeling using LSTM's for the classification. It should be noted that the focus of this work is on identifying the UI's in user utterance and not on investigating the mechanisms of using them for belief tracking, which is part of our larger goal.
UI's are generic in nature and independent of the dialog domain. Given a slot type (such as numeric), they can be applied to any slot of that type. This enables transfer learning across similar slots in different domains. To demonstrate this, we experiment with two domains (shopping and restaurants) and define three types of slots: 1. Numeric slots, 2. Conjunctive multi-value (CMV) slots, and 3. Disjunctive multi-value (DMV) slots (explained in Section 3.1.1). We then delexicalize slot-values in user utterances with the corresponding slot type (not slot name) and conduct cross-domain training and testing experiments. Experimental results demonstrate strong classification performance in individual domains as well as across domains.
Contributions: 1) We propose a new semantic class of slot-specific user intents (UI's) which are directly related to the update a user intends to perform for a slot. 2) The proposed UI's enable effective updates to slots. 3) Our models predict UI's with high accuracy. 4) We present a novel delexicalization approach which enables transfer learning of UI's across domains.

Related Work
Although we are not aware of any prior work on identifying update intents, our current work is related to dialog act identification and dialog state tracking. Here, we review works in these two areas and contrast them against ours. Dialog act identification: Dialog acts (DA) in an utterance express the intention of their speaker/writer. Identifying DA types has been found to be useful in many natural language processing tasks such as question answering, summarization, and spoken language understanding (SLU) in dialog systems. A variety of DA's have been proposed for specific application tasks and domains, such as email conversations (Cohen et al., 2004), online forum discussions (Bhatia et al., 2012;Kim et al., 2010), and dialog systems (Young, 2007). The latter is relevant to this work. In dialog systems, DA's are used to infer a user's intention towards either the slots or the conversation in general. Some of the DA's used in dialog systems are inform, confirm, deny, and negate. Previous works on DA identification in dialog systems have used a range of approaches like n-grams based utterance level SVM classifier (Mairesse et al., 2009), SVM classifier built on weighted n-grams using word confusion networks incorporating ASR uncertainties and dialog context (Henderson et al., 2012), log linear models (Chen et al., 2013), and recurrent neural networks (Hori et al., 2015(Hori et al., , 2016Ushio et al., 2016). This work is similar to DA identification in the sense that both the UI's and the DA's express certain semantics in the utterance and are independent of the dialog domain. However, there are important differences: 1) DA's mainly reflect the intent towards the conversation; however, UI's convey the type of update a user wants to a particular slot. 2) DA's can be slot-independent (such as hello, negate, etc.) whereas UI's are always defined with respect to a slot. Dialog State Tracking: Dialog state tracking (DST) entails updating the conversation state (also known as belief state) after every dialog turn. A conversation state is a probability distribution over competing user goals which are expressed in the form of slot-value pairs. For a user utterance, DST relies on SLU to get a list of k-best hypotheses of DA's and slot-value pairs expressed in the utterance. To update the belief state, DST approaches utilize DA's by using their SLU confidence scores as features (Ren et al., 2013;Kim and Banchs, 2014), encoding the DA's using n-gram vectors weighted by the SLU confidence scores (Henderson et al., 2014c;Mrkšić et al., 2015), and using rule-based updates (Lee and Stent, 2016). Recently, efforts have been made to bypass the SLU output and learn update mechanisms directly from user utterance (Mrksic et al., 2017). Though DA's are important for updating belief state, as explained in Section 1, certain updates like implicit denials, numeric updates, and slot preferences are not handled by the DA's used in the dialog systems literature. UI's, on the other hand, are proposed to address this problem. The work by Hakkani-Tür et al. (Hakkani-Tür et al., 2012) on identifying action updates in a multi-domain dialog system is closely related to the current work. Some of their action updates are similar to UI's. However, unlike the current work, they did not deal with numeric updates and did not distinguish between types of multi-value slots (explained in Section 3.1.1).

Approach
In task-based dialogs, users complete a task by giving sequences of utterances in which they specify slot-values with corresponding intents. Dialog

UI Type Definition
Append Append a specified value to the slot.

Remove
Remove a specified value from the slot.
IncreaseBy Increase a value of a slot by a specified amount.
DecreaseBy Decrease a value of a slot by a specified amount.

Replace
Replace the value of a slot by a specified value. systems extract this information using dialog act detection and slot-filling as part of SLU. The most common and helpful intents for completing a task are setting a value for a slot and denying a particular value for a slot. Traditionally, these two intents are determined by the inform and deny dialog acts. However, as explained in Section 1, a user may not always set and deny a value explicitly. While denials can be implicit, relative preferences can also be provided for slot-value(s). In case of numeric slots, user can set a value by incrementing or decrementing the previous values of slots. All these common scenarios are not handled by the inform and deny dialog acts.
In this work, we propose a new set of slotspecific intents which are directly related to the type of update expressed towards the slot. We call these intents update intents or UI's. The UI's express five common types of updates: 1. Append: A user specifies a value or multiple values for a multi-value slot. This is equivalent to "appending" the specified value(s) to a multi-value slot. (Refer to Section 3.1.1 for the definition of multi-value and numeric slots). 2. Remove: A user denies a value or multiple values for a multi-value slot implicitly or explicitly. This is equivalent to "removing" the specified value(s) from a multi-value slot. 3. Replace: A user specifies a preference for a slot value in case of multi-value slots. In case of numeric (single value) slots, this intent expresses setting and re-setting of a slot value (Utterances 4 and 7 in Table 1). This UI is defined with respect to the slot-value for which the preference is expressed. For example, in the utterance "I would prefer San Jose over Gilroy" the UI for San Jose is replace, whereas for Gilroy it is remove. Note that in case of multi-value slots, replace cannot be decomposed into a combination of an "append" and a "remove" update when the "remove" intent is not specified. For example, in "I would prefer San Jose" there is no "remove" intent and, hence, simply using the "append" intent for San Jose would not extract the preference for San Jose. 4. IncreaseBy: A user specifies a value by which a particular numeric slot's value is to be increased. 5. DecreaseBy: A user specifies a value by which a particular numeric slot's value is to be decreased.

Modeling
Given a user utterance, the goal is to determine UI's for all the slot-values present in it. We formulate this task as a classification problem. Given a user utterance and the mentioned slot-values, classify the update intents for all the slot-values in one of the five classes: Append, Remove, Replace, In-creaseBy and DecreaseBy.
We model the above problem as a sequence labeling task. We treat a user utterance as a sequence of words and slot-values. The labels for slot-values are the corresponding UI's whereas for words (which are not slot-values), we define a generic label "TOKEN". For model optimization and error computation, we do not consider the "TOKEN" labels. Figure 1 describes our model architecture. We used Bidirectional LSTMs (Graves and Schmidhuber, 2005) for sequence labeling. For input representation, we used GloVe word embeddings (Pennington et al., 2014). For regularization, we used dropout and early stopping. We give more details about model parameters in Section 5.

Learning Across Domains
In many cases, it is not possible to list all the values of a slot in the ontology. Even if the values are listed, it may not be practical to generate a training data containing all the values, if there are too many values for the slot. In such cases, it is beneficial to unlink the learning from particular slot-values and link it, instead, to the slot itself. This is because the word patterns used to refer to values of the same slot are similar and hence can be shared across the values. For example, a user would use similar word patterns to refer to values of slot "LOCA-TION". One way to do this is by replacing slotvalues in utterances with the name of the slot. This is also called delexicalization and has been used successfully in many previous works (Henderson et al., 2014c;Mrkšić et al., 2015). In our model, we also delexicalize slot-values with the name of the slot as shown in Figure 1.
Delexicalization with slot names is helpful in generalizing to slot-values not seen in the training data in one domain. However, it cannot be used for cross-domain generalization as different domains may not share the same slots. To address this problem, we define three generic slot types depending upon the values (numeric/non-numeric) a slot can take and whether a slot can take multiple values simultaneously (list-based) or not: 1. Numeric slot: Slots whose values can be increased and decreased. NGUEST in Table 1 is a numeric slot. A numeric slot is a single value slot, i.e., "appending" and "removing" (multiple) values are not allowed for numeric slots. This slot supports IncreaseBy, DecreaseBy and Replace UI's.  Table 1 is a DMV slot. 3. Conjunctive multi-value (CMV) slots: These are list type slots which can take multiple values in conjunction. Examples are slots containing names of people going to an event, items in a shopping list, etc. Slot PNAMES in Table 1 is a CMV slot. Both CMV and DMV slots support Append, Remove and Replace UI's.
Different domains may not share same slots but they often share slots with same types. For example, list of groceries in the shopping domain and list of people in a dinner invite in the restaurant domain are of type CMV. Similarly, the number of guests in a dinner reservation and the number of items of a particular grocery are of type numeric. If we delexicalize slot-values with slot types, we can transfer learning for a slot type in one domain to the same slot type in another domain.
There can be cases where there are different ways (word patterns) to specify updates to two slots even if they are of the same type, because of differences in the corresponding domains or some other reason. For example, lets say slots S 1 and S 2 are in different domains but share the same slottype S and we have training data for slot S 1 . S 1 and S 2 have similarities owing to their common slot-type but have certain differences in the ways users can express update intents for them. In such a case, to generate training data for S 2 , we would need data capturing the differences between the two slots because the examples with common features are already contained in S 1 's training data. Generating this additional data is easier than generating the full data for S 2 . The amount of additional data required will depend upon the degree by which the slots (S 1 and S 2 ) differ. When applied to a large number of slots and domains, this strategy would significantly reduce the time and effort that goes into data generation. To demonstrate this, we conduct training and testing experiments on two domains, restaurants and online shopping, and report results in Section 5.2.

Data Preparation
To evaluate our approach, we needed dialogs containing numeric, CMV, and DMV slots in the domain ontology along with the proposed update intents expressed in user utterances. Existing datasets with annotated dialog acts such as WOZ 2.0 (Wen et al., 2017), ATIS (Dahl et al., 1994), Switchboard DA corpus 1 , Dialog State Tracking Challenge (DSTC) datasets (Henderson et al., 2014a;Williams et al., 2013;Henderson et al., 2014b) and ICSI meeting recorder DA corpus (Shriberg et al., 2004) did not satisfy these requirements. DSTC 2 and DSTC 3 datasets contained DMV slots but not the CMV (list-based slots) and numeric slots 2 . DSTC 4 (Kim et al., 2015), DSTC 5 (Kim et al., 2016) and DSTC 6 (Boureau et al., 2017) introduced a new set of speech acts which contains "HOW MUCH" act for the numeric price range and time slots. However, the act only supports the Replace UI and not the IncreaseBy and DecreaseBy UI's. Moreover, the three datasets are not public. Therefore, we generated our own datasets.
We generated user utterances in two domains: restaurants and online shopping. In each domain, eight human editors generated user utterances independent of each other. The sets of editors were different across the two domains. Table 3 explains the slots used in the two domains. For each domain, we defined certain tasks which are listed in Table 4. Editors wrote conversations to complete those tasks. Since this was not a real dialog system, editors were asked to assume appropriate bot responses based on their requests such as "Okay", "Added", "Removed", "Done" during the conversation. Also, since the focus was on identifying update intents and not on the overall SLU, (dialog act detection, slot-filling, etc.), we did not build our own custom slot-tagger and, instead, asked the editors to annotate the slot-values with the corresponding slot name in addition to the update intents. Here is a sample annotation for the task "restaurant reservation". For the shopping domain, 305 conversations with 1308 user utterances were generated. For the restaurant domain, 280 conversations with 1323 user utterances were generated. The distribution of utterances among editors is 96,110,212,79,176,258,211

Experimental Setting
We implement the proposed architecture in Section 3 using Keras (Chollet et al., 2015), a high-level neural networks API, with the Tensorflow (Abadi et al., 2015) backend. Training is  done by mini-batch RMSProp (Hinton et al., 2012) with a fixed learning rate. In all our experiments, mini-batch size is fixed to 64. Training and inference are done on a per-utterance level. The embedding layer in the model is initialized with 300dimensional Glove word vectors obtained from common crawl (Pennington et al., 2014). Embeddings for missing words are initialized randomly with values between −0.5 and 0.5. Evaluation: Using a random split of train and test sets would have examples from the same editor in both train and test sets which would bias the estimation. Therefore, we split our data into eight folds corresponding to the eight editors, i.e., each fold contains examples from only one of the editors. To evaluate our models, we train and validate on the data from seven folds and test the performance on the held-out (eighth) fold. We run this experiment for each editor, i.e., eight times, and average results across the eight folds. For validation, we use 15% of the training data. We use precision, recall and F-1 score to report the performance of our classifiers. Overall classification performance metrics are computed by taking the weighted average of the metrics for individual classes. A class's weight is the ratio of the number of instances in it to the total number of instances.
Parameter tuning: In each experiment, 15% of the current training set is utilized as a development set for hyper-parameter tuning and the model with best setting is applied to the test set to report the results. We tune learning rate, dropout via grid search on the development set. In addition, we uti-lize early stopping to avoid over-fitting. The optimal hyper-parameter settings for our classification experiments (reported in Table 5) is dropout = 0.3, learningrate = 0.001 for the restaurants domain and dropout = 0.25, learningrate = 0.001 for the shopping domain. Baseline: We used n-grams based multinomial logistic regression as a baseline. N-grams based models have been extensively used in text classification (Biyani et al., 2016(Biyani et al., , 2013(Biyani et al., , 2012. Such models have also been found to be effective as semantic tuple classifiers for dialog act detection and slot filling tasks (Chen et al., 2013;Henderson et al., 2012). Since there can be multiple slot-values and, hence, multiple UI's expressed in a user utterance, the entire utterance cannot be used to extract n-grams for all the expressed UI's. Therefore, we segment user utterances into relevant contexts for the slot-values and classify the contexts into one of the five UI classes. A context for a value is an ordered list of words which are indicative of the update to be performed for the value. We use two approaches for segmentation based on the k words window approach: a) hard segmentation, b) soft segmentation. In the first approach, we assign the words around the value to its context based on the following constraints: 1. If an utterance contains only one value, the entire utterance is taken as the context for the value.
2. If there are n words (s.t. n < 2k) between two slot values then the preference is given to the right value. That is, k words are assigned to the context of the right value and n − k words are assigned to the context of the left value. 3. All the words to the left of the first value (in the utterance) are added to the value's context. Similarly, all the words to the right of the last value are added to its context. In soft segmentation, we do not perform a hard assignment of the words, between the two values to the context of one of the values. Instead, we encode the words into one of these categories based on its position with respect to the value and if it is in between two values (and let the model learn weights for words in each category): 1) towards left of a value and between two values, 2) towards right of a value and between two values, 3) towards left of a value, 4) towards right of a value.
We extracted unigrams and bigrams from the context of slot-values. We experimented with different window sizes and k=2 gave the best results.   Table 5: Classification results on the two domains.

Results
In this section, we present the results of our classification and domain-independence experiments.  and DecreaseBy perform very well. One of the reasons for this behavior could be that after delexicalization, for these two classes, there is only one slot (QTY in shopping and NGUEST in restaurants) for which the model learns the patterns. Other than these two classes, this slot is shared by the Replace class. Hence, given a delexicalized numeric slot-value, the model needs to discriminate between these three classes whose relative distribution is much smoother than the overall distribution of the five classes. For the other delexicalized slot-values, the model discriminates between Append, Remove and Replace, where the majority class has a much higher number of examples than the minority Remove class. Hence, we see that the Append class performs significantly  better than the Remove class.

Classification Results
We also compare our model with the two baselines explained in Section 5.1. Table 6 presents these results. We see that the proposed model significantly outperforms the two baselines. This shows that for UI classification, contextual information around a slot-value is captured much more effectively using sequence models than static classifiers. We also experimented with our model without delexicalization to verify the gains it brings. As can be seen, delexicalization does improve the performance in both domains.

Domain Independence Results
We conducted experiments to explore if learning of UI's in a domain can be used to predict UI's in a different domain. We use one of the domains as the "in-domain" (where learning is transferred to) and the other as the "out-domain" (where learning is transferred from). For this experiment, we set aside 20% of the in-domain data as the test set. At each step, we use 15% of the training data as the validation set. We explored two settings: 1. Combined-training: In this setting, we start by training our model on the entire out-domain data and then, incrementally, add a fraction (10%) of the in-domain data (left after taking out the test data) to the current training data, retrain the model (from scratch) on the combined data. 2. Pre-training: Here, we train a model on the out-domain data and fine-tune it with the indomain data. At each step, we add a fraction (10%) of the in-domain data to the current training data and refit the pre-trained out-domain model on it by initializing the model weights to the weights of the model trained on the out-domain data.

0%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of in-domain data  Figure 2, the model trained only on the out-domain (restaurant) data achieves F-1 score of more than 80% on the in-domain test set. As we add more in-domain data, the F-1 score increases. With only 30% of the in-domain data, we get 89% F-1 score. Also, we see that pre-training and combined-training have similar performances.
For Figure 3, the out-domain model achieves a much lower F-1 score on the in-domain data. This shows that the transfer is not symmetric. This could be due to the PNAME slot, which has no similar slots in the shopping domain. There is also a difference between the performance curve of pre-training and combined-training. This indicates that fine-tuning a pre-trained model is harder than combined training when patterns are not covered by the out-domain data.

Conclusions and Future Work
We proposed a new type of slot-specific user intents, update intents (UI's), that are directly related to the type of update a user intends for a slot. The UI's address user intents containing implicit denials, numeric updates and preferences for slot-values, which are not handled by the currently used dialog acts. We presented a sequence labeling model for classifying UI's. We also proposed a method to transfer learning of UI's across domains by delexicalizing slot-values with their slot types. For that, we defined three generic slot types. Experimental results showed strong performance for UI classification and promising results for the domain independence experiments. In the future, we plan to explore mechanisms to utilize the UI's in belief tracking.