Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference

Dialogue contexts are proven helpful in the spoken language understanding (SLU) system and they are typically encoded with explicit memory representations. However, most of the previous models learn the context memory with only one objective to maximizing the SLU performance, leaving the context memory under-exploited. In this paper, we propose a new dialogue logistic inference (DLI) task to consolidate the context memory jointly with SLU in the multi-task framework. DLI is defined as sorting a shuffled dialogue session into its original logical order and shares the same memory encoder and retrieval mechanism as the SLU model. Our experimental results show that various popular contextual SLU models can benefit from our approach, and improvements are quite impressive, especially in slot filling.


Introduction
Spoken language understanding (SLU) is a key technique in today's conversational systems such as Apple Siri, Amazon Alexa, and Microsoft Cortana. A typical pipeline of SLU includes domain classification, intent detection, and slot filling (Tur and De Mori, 2011), to parse user utterances into semantic frames. Example semantic frames  are shown in Figure 1 for a restaurant reservation.
Traditionally, domain classification and intent detection are treated as classification tasks with popular classifiers such as support vector machine and deep neural network (Haffner et al., 2003;Sarikaya et al., 2011). They can also be combined into one task if there are not many intents of each domain (Bai et al., 2018). Slot filling task is usually treated as a sequence labeling task. Popular approaches for slot filling include conditional random fields (CRF) and recurrent neural network (RNN) (Raymond and Riccardi, 2007; B-time  Yao et al., 2014). Considering that pipeline approaches usually suffer from error propagation, the joint model for slot filling and intent detection has been proposed to improve sentence-level semantics via mutual enhancement between two tasks (Xu and Sarikaya, 2013;Hakkani-Tür et al., 2016;Zhang and Wang, 2016;Goo et al., 2018), which is a direction we follow. To create a more effective SLU system, the contextual information has been shown useful (Bhargava et al., 2013;Xu and Sarikaya, 2014), as natural language utterances are often ambiguous. For example, the number 6 of utterance u 2 in Figure 1 may refer to either B-time or B-people without considering the context. Popular contextual SLU models Bapna et al., 2017) exploit the dialogue history with the memory network (Weston et al., 2014), which covers all three main stages of memory process: encoding (write), storage (save) and retrieval (read) (Baddeley, 1976).
With such a memory mechanism, SLU model can retrieve context knowledge to reduce the ambiguity of the current utterance, contributing to a stronger SLU model. However, the memory consolidation, a well-recognized operation for main-

Sentence Encoder
Current Utterance taining and updating memory in cognitive psychology (Sternberg and Sternberg, 2016), is underestimated in previous models. They update memory with only one objective to maximizing the SLU performance, leaving the context memory under-exploited.
In this paper, we propose a multi-task learning approach for multi-turn SLU by consolidating context memory with an additional task: dialogue logistic inference (DLI), defined as sorting a shuffled dialogue session into its original logical order. DLI can be trained with contextual SLU jointly if utterances are sorted one by one: selecting the right utterance from remaining candidates based on previously sorted context. In other words, given a response and its context, the DLI task requires our model to infer whether the response is the right one that matches the dialogue context, similar to the next sentence prediction task (Logeswaran and Lee, 2018). We conduct our experiments on the public multi-turn dialogue dataset KVRET (Eric and Manning, 2017), with two popular memory based contextual SLU models. According to our experimental results, noticeable improvements are observed, especially on slot filling.

Model Architecture
This section first explains the memory mechanism for contextual SLU, including memory encoding and memory retrieval. Then we introduce the SLU tagger with context knowledge, the definition of DLI and how to optimize the SLU and DLI jointly.

Memory Encoding
To represent and store dialogue history (Chung et al., 2014) layer and then encode the current utterance x k+1 into sentence embedding c with another BiGRU:

Memory Retrieval
Memory retrieval refers to formulating contextual knowledge of the user's current utterance x k+1 by recalling dialogue history. There are two popular memory retrieval methods: The attention based ) method first calculates the attention distribution of c over memories M by taking the inner product followed by a softmax function. Then the context can be represented with a weighted sum over M by the attention distribution: where p i is the attention weight of m i . In Chen et al., they sum m ws with utterance embedding c, then multiplied with a weight matrix W o to generate an output knowledge encoding vector h: The sequential encoder based (Bapna et al., 2017) method shows another way to calculate h: where the function FF() is a fully connected forward layer.
where O where U and V are weight matrices of output layers and t is the index of each word in utterance x k+1 .

Dialogue Logistic Inference
As described above, the memory mechanism holds the key to contextual SLU. However, context memory learned only with SLU objective is underexploited. Thus, we design a dialogue logistic inference (DLI) task that can consolidate the context memory by sharing encoding and retrieval components with SLU. DLI is introduced below: Given a dialogue session X = {x 1 , x 2 , ...x n }, where x i is the ith sentence in this conversation, we can shuffle X into a random order set X . It is not hard for human to restore X to X by determining which is the first sentence then the second and so on. This is the basic idea of DLI: choosing the right response given a context and all candidates. For each integer j in range k + 1 to n, training data of DLI can be labelled automatically by: where k+1 is the index of the current utterance. In this work, we calculate the above probability with a 2-dimension softmax layer: where W d is a weight matrix for dimension transformation.

Joint Optimization
As we depict in Figure 2, we train DLI and SLU jointly in order to benefit the memory encoder and memory retrieval components. Loss functions of SLU and DLI are as follows.
L SLU = log(p(y I |x 1 , ..., x k+1 )) + t log(p(y S t |x 1 , ..., x k+1 )) (11) where x j is a candidate of the current response, y I , y S t and y D are training targets of intent, slot and DLI respectively. Finally, the overall multitask loss function is formulated as where λ is a hyper parameter.

Experiments
In this section, we first introduce datasets we used, then present our experimental setup and results on these datasets.

Datasets
KVRET (Eric and Manning, 2017) is a multi-turn task-oriented dialogue dataset for an in-car assistant. This dataset was collected with the Wizardof-Oz scheme (Wen et al., 2017)

Experimental Setup
We conduct extensive experiments on intent detection and slot filling with datasets described above. The domain classification is skipped because intents and domains are the same for KVRET. For training model, our training batch size is 64, and we train all models with Adam optimizer with default parameters (Kingma and Ba, 2014). For each model, we conduct training up to 30 epochs with five epochs' early stop on validation loss. The word embedding size is 100, and the hidden size of all RNN layer is 64. The λ is set to be 0.3. The dropout rate is set to be 0.3 to avoid over-fitting.

Results
The following methods are investigated and their results are shown in Table 2: NoMem: A single-turn SLU model without memory mechanism. MemNet: The model described in Chen et al. , with attention based memory retrieval. SDEN: The model described in Bapna et al. , with sequential encoder based memory retrieval.
SDEN † : Similar with SDEN, but the usage of h is modified with Eq.6.
As we can see from Table 2, all contextual SLU models with memory mechanism can benefit from our dialogue logistic dependent multi-task framework, especially on the slot filling task. We also note that the improvement on intent detection is trivial, as single turn information has already trained satisfying intent classifiers according to results of NoMem in Table 2. Thus, we mainly analyze DLI's impact on slot filling task and the prime metric is the F1 score.
In Table 2, the poorest contextual model is the SDEN, as its usage of the vector h is too weak: simply initializes the BiLSTM tagger's hidden state with h, while other models concatenate h with BiLSTM's input during each time step. The more the contextual model is dependent on h, the more obvious the improvement of the DLI task is. Comparing the performance of MemNet with SDEN † on these two datasets, we can find that our SDEN † is stronger than MemNet after the dialogue length increased. Finally, we can see that improvements on KVRET* are higher than KVRET. This is because retrieving context knowledge from long-distance memory is challenging and our proposed DLI can help to consolidate the context memory and improve memory retrieval ability significantly in such a situation.
We further analyze the training process of SDEN † on KVRET* to figure out what happens to our model with DLI training, which is shown in Figure 3(a). We can see that the validation loss of SDEN † + DLI falls quickly and its slot F1 score is relatively higher than the model without DLI training, indicating the potential of our proposed method.
To present the influence of hyper-parameter λ, we show SLU results with λ ranging from 0.1 to 0.9 in Figure 3(b). In this figure, we find that the improvements of our proposed method are relatively steady when λ is less than 0.8, and 0.3 is the best one. When λ is higher than 0.8, our model tends to pay much attention to the DLI task, overlook detail information within sentences, leading the SLU model to perform better on the intent detection but failing in slot filling.

Conclusions
In this work, we propose a novel dialogue logistic inference task for contextual SLU, with which memory encoding and retrieval components can be consolidated and further enhances the SLU model through multi-task learning. This DLI task needs no extra labeled data and consumes no extra inference time. Experiments on two datasets show that various contextual SLU model can benefit from our proposed method and improvements are quite impressive, especially on the slot filling task. Also, DLI is robust to different loss weight during multi-task training. In future work, we would like to explore more memory consolidation approaches for SLU and other memory related tasks.