How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding in Dialogues

Spoken language understanding (SLU) is an essential component in conversational systems. Most SLU components treats each utterance independently, and then the following components aggregate the multi-turn information in the separate phases. In order to avoid error propagation and effectively utilize contexts, prior work leveraged history for contextual SLU. However, most previous models only paid attention to the related content in history utterances, ignoring their temporal information. In the dialogues, it is intuitive that the most recent utterances are more important than the least recent ones, in other words, time-aware attention should be in a decaying manner. Therefore, this paper designs and investigates various types of time-decay attention on the sentence-level and speaker-level, and further proposes a flexible universal time-decay attention mechanism. The experiments on the benchmark Dialogue State Tracking Challenge (DSTC4) dataset show that the proposed time-decay attention mechanisms significantly improve the state-of-the-art model for contextual understanding performance.


Introduction
Spoken dialogue systems that can help users to solve complex tasks such as booking a movie ticket have become an emerging research topic in artificial intelligence and natural language processing areas. With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions. Today, there are several virtual intelligent assistants, such as Apple's Siri, Google's Home, Microsoft's Cortana, and Amazon's Echo. Recent advance of deep learning has inspired many applications of neural models to dialogue systems (Wen et al., 2017;Bordes et al., 2017;Dhingra et al., 2017;. A key component of a dialogue system is a spoken language understanding (SLU) moduleit parses user utterances into semantic frames that capture the core meaning (Tur and De Mori, 2011). A typical pipeline of SLU is to first decide the domain given the input utterance, and based on the domain, to predict the intent and to fill associated slots corresponding to a domainspecific semantic template, where each utterance is treated independently (Hakkani-Tür et al., 2016;Chen et al., 2016b,a;. To overcome the error propagation and further improve understanding performance, the contextual information has been shown useful (Bhargava et al., 2013;Xu and Sarikaya, 2014;Sun et al., 2016). Prior work incorporated the dialogue history into the recurrent neural networks (RNN) for improving domain classification, intent prediction, and slot filling (Xu and Sarikaya, 2014;Shi et al., 2015;Chen et al., 2016c). Recently,  and Zhang et al. (2018) demonstrated that modeling speaker role information can learn the notable variance in speaking habits during conversations in order to benefit understanding.
In addition, neural models incorporating attention mechanisms have had great successes in machine translation (Bahdanau et al., 2014), image captioning (Xu et al., 2015), and various tasks. Attentional models have been successful because they separate two different concerns: 1) deciding which input contexts are most relevant to the output and 2) actually predicting an output given the most relevant inputs. For example, the highlighted current utterance from the tourist, "uh on august", in the conversation of Figure 1 is to respond the question about WHEN, and the content-Guide: and you were saying that you wanted to come to singapore Guide: uh maybe can i have a little bit more details like uh when will you be coming Guide: and like who will you be coming with Tourist: uh yes Tourist: um i'm actually planning to visit Tourist: uh on august aware contexts that can help current understanding are the first two utterances from the guide "and you were saying that you wanted to come to singapore" and "un maybe can i have a little bit more details like uh when will you be coming". Previous work proposed an end-to-end time-aware attention network to leverage both contextual and temporal information for spoken language understanding and achieved the significant improvement, showing that the temporal attention can guide the attention effectively . However, the time-aware attention function is an inflexible hand-crafted setting, which is a fixed function of time for assessing the attention.
This paper focuses on investigating various flexible time-aware attention mechanism in neural models with contextual information and speaker role modeling for language understanding. The contributions are three-fold: • This paper investigates different time-aware attention mechanisms and provides guidance for the future research about designing the time-aware attention function. • This paper proposes an end-to-end learnable universal time-decay mechanism with great flexibility of modeling temporal information for diverse dialogue contexts. • The proposed model achieves the state-ofthe-art understanding performance in the dialogue benchmark DSTC dataset.

The Proposed Framework
The model architecture is illustrated in Figure 2. First, the previous utterances are fed into the contextual model to encode into the history summary, and then the summary vector and the current utterance are integrated for helping understanding. The contextual model leverages the attention mechanisms highlighted in red, which implements different attention functions for sentence and speaker role levels. The whole model is trained in an end-to-end fashion, where the history summary vector and the attention weights are automatically learned based on the downstream SLU task. The objective of the proposed model is to optimize the conditional probability of the intents given the current utterance, p(ŷ | x), by minimizing the crossentropy loss.

Speaker Role Contextual Language Understanding
Given the current utterance x = {w t } T 1 , the goal is to predict the user intents of x, which includes the speech acts and associated attributes. We apply a bidirectional long short-term memory (BLSTM) model (Schuster and Paliwal, 1997) to history encoding in order to learn the probability distribution of the user intents.
where W his is a weight matrix and v his is the history summary vector, v cur is the context-aware vector of the current utterance encoded by the BLSTM, and o is the intent distribution. Note that this is a multi-label and multi-class classification, so the sigmoid function is employed for modeling the distribution after a dense layer. The user intent labels are decided based on whether the value is higher than a threshold tuned by the development set.
Considering that speaker role information is shown to be useful for better understanding in complex dialogues , we follow the prior work for utilizing the contexts from two roles to learn history summary representations, v his in (1), in order to leverage the role-specific contextual information. Each role-dependent recurrent unit BLSTM role i receives corresponding inputs, x t,role i , which includes multiple utterances u i (i = [1, ..., t − 1]) preceding the current utterance u t from the specific role, role i , and have been processed by an encoder model. History Summary convex linear concave where x t,role are vectors after one-hot encoding that represent the annotated intent and the attribute features. Note that this model requires the ground truth annotations for history utterances for training and testing. Therefore, each role-based contextual module focuses on modeling role-dependent goals and speaking style, and v cur from (1) would contain role-based contextual information.

Neural Attention Mechanism
One of the earliest work with a memory component applied to language processing is memory networks Sukhbaatar et al., 2015), which encodes mentioned facts into vectors and stores them in the memory for question answering. The idea is to encode important knowledge and store it into memory for future usage with attention mechanisms. Attention mechanisms allow neural network models to selectively pay attention to specific parts. There are also various tasks showing the effectiveness of attention mechanisms (Xiong et al., 2016;Chen et al., 2016c). Recent work showed that two attention types (content-aware and time-aware) and two attention levels (sentence-level and role-level) significantly improve the understanding performance for complex dialogues. This paper focuses on expanding the time-aware attention based on the investigation of different time-decay functions, and further learning an universal time-decay function automatically. For time-aware attention mechanisms, we apply it using two levels, sentence-level and role-level structures, and Section 3 details the design and analysis of time-aware attention.
For the sentence-level attention, before feeding into the contextual module, each history vector is weighted by its time-aware attention α u j for replacing (3): For the role-level attention, a dialogue is disassembled from a different perspective on which speaker's information is more important . The role-level attention is to decide how much to address on different speaker roles' contexts (v his,role ) in order to better understand the current utterance. The importance of a speaker given the contexts can be approximated to the maximum attention value among the speaker's utterances, α role = max α u j , where u j includes all contextual utterances from the speaker. With the role-level attention, the sentence-level history from (3) can be rewritten into for combining role-dependent history vectors with their attention weights.

End-to-End Training
The objective is to optimize SLU performance, predicting multiple speech acts and attributes described in Section 2.1. In the proposed model, all encoders, prediction models, and attention weights can be automatically learned in an endto-end manner.

Time-Decay Attention Learning
The decaying function curves can be easily separated into three types: convex, linear, and concave, illustrated in the top-right part of Figure 2, and each type of time-decay functions expresses a time-aware perspective given dialogue contexts. Note that all attention weights will be normalized such that their summation is equal to 1.

Convex Time-Decay Attention
A convex curve also known as "concave upward", in a simple 2D Cartesian coordinate system (x, y), a convex curve f (x) means when x goes greater, the slope f (x) is increasing. Intuitively, recent utterances contain more salient information, and the salience decreases very quickly when the distance increases; therefore we introduce the timeaware attention mechanism that computes attention weights according to the time of utterance occurrence explicitly. We first define the time difference between the current utterance and the preceding sentence u i as d(u i ), and then simply use its reciprocal to formulate a convex time-decay function: where a and b are scalar parameters. The increasing slopes of the decay-curve assert that importance of utterances should be attenuated rapidly, and the importance of a earlier history sentence would be considerably compressed. Note that Chen et al. used a fixed convex time-decay function (a = 1, b = 1) .

Linear Time-Decay Attention
A linearly decaying time-aware attention function should also be taken into consideration. In a simple 2D Cartesian coordinate system (x, y), the slopes of a linear function remain consistent when x changes. That is, the importance of preceding utterances linearly declines as the distance between the previous utterance and the target utterance becomes larger.
where e and f are the slope and the α-intercept of the linear function. Note that when the distance d(u i ) is larger than − f e , we assign the attention value as 0.

Concave Time-Decay Attention
A concave curve also called "concave downward", in contrast to convex curves, in a simple 2D Cartesian coordinate system (x, y), a concave curve f (x) means that the slope f (x) is decreasing when x goes greater. Intuitively, the attention weight decreases relatively slow when the distance increases. To implement this idea, we design a Butterworth f ilter-like low-distance pass filter (Butterworth, 1930) that is similar to the concave time-decay function in the beginning of the curve.
where D 0 is the cut-off distance and n is the order of filter. The decreasing slopes of the decay-curve assert that the importance of utterances should weaken gradually, and the importance of a earlier history sentence would still be considerably compressed. Moreover, it is more likely to preserve the information in the multiple recent utterances instead of focusing only on the most recent one.

Universal Time-Decay Attention
As mentioned previously, there are three types of decaying curves: convex, linear, concave, each type represents a different perspective on dialogue contexts and models different contextual patterns. However, because the contextual patterns may be diverse, a single type of function could not fit the complex behavior well. Hence, we propose a flexible and universal time-decay attention function by composing three types of attentional curves:

Experiments
To evaluate the proposed model, we conduct the language understanding experiments on humanhuman conversational data.

Setup
The experiments are conducted using the DSTC4 dataset, which consist of 35 dialogue sessions on touristic information for Singapore collected from Skype calls between 3 tour guides and 35 tourists, these 35 dialogs sum up to 31,034 utterances and 273,580 words (Kim et al., 2016). All recorded dialogues with the total length of 21 hours have been manually transcribed and annotated with speech acts and semantic labels at each turn level. The speaker information (guide and tourist) is also provided. Unlike previous DSTC series collected human-computer dialogues, human-human dialogues contain rich and complex human behaviors and bring much difficulty to all the tasks. Given the complex dialogue patterns and longer contexts, DSTC4 is a suitable benchmark dataset for evaluation. We randomly selected 28 dialogues as the training set, 5 dialogues as the testing set, and 2 dialogues as the validation set. We choose the mini-batch Adam as the optimizer with the batch size of 256 examples. The size of each hidden recurrent layer is 128. We use pre-trained 200-dimensional word embeddings GloV e (Pennington et al., 2014). We only apply 30 training epochs without any early stop approach. We focus on predicting multiple labels including intents and attributes, so the evaluation metric is an average F1 score for balancing recall and precision in each utterance. The experiments are shown in Table 1, where we report the average results over five runs. We include the best understanding performance (row (a)) from the participants of DSTC4 in IWSDS 2016 for reference (Kim et al., 2016). The one-tailed t-test is performed to validate the significance of improvement, and the numbers with markers indicate the significant improvement with p < 0.05.

Effectiveness of Time-Decay Attention
To evaluate the proposed time-decay attention, we compare the performance with the naïve LU model without any contextual information (row (b)), the contextual model without any attention mechanism (row (c)), and the one using the content-aware attention mechanism (row (d)), where the attention can be learned at sentence and role levels. The row (a) is the performance reported in the DSTC challenge 2 . It is intuitive that the model without considering contexts (row (b)) performs much worse than the contextual ones for dialogue modeling. The proposed time-aware results are shown in the rows (e)-(h), where the rows (e)-(f) use only the time-aware attention while the rows (g)-(h) model both content-aware and timeaware attention mechanisms together. It is obvious that almost all time-aware results are better than three baselines.
In order to investigate the performance of various time-decay attention functions, for each curve we apply two settings: 1) Hand: hand-crafted hyper-parameters (rows (e) and (g)) and 2) E2E: end-to-end training for parameters (rows (f) and (h)). In the hand-crafted setting, the hyperparameters a = 1, b = 1, e = −0.125, f = 1, D 0 = 5, n = 3 are adopted 3 . Table 1 shows that among three types of the sentence-level timedecay attention, only the convex time-decay attention significantly outperforms the baselines, indicating that an unsuitable time-decay attention function is barely useful. For both settings, the convex functions perform best among the three types of time-decay functions. Also, the end-toend trainable setting results in better performance for most cases.
For our proposed universal time-decay attention mechanism, the same settings are conducted: 1) composing fixed versions for three types of time-decay functions weighted by learned parameters w i and 2) fully trainable parameters for all time-decay functions. These two settings provide different levels of flexibility in fitting dialogue contextual attention, and the experimental results show that two settings both outperform all other time-decay attention functions.
For sentence-level attention, the end-to-end trainable universal time-decay attention achieves best performance (rows (f) and (h)), where the flexible time-aware attention (rows (f) and (h)) obtains 2.9% relative improvement compared to the model without the attention mechanism (row (c)) and the model using content-aware attention only (row (d)). For role-level attention, all types of time-decay functions significantly improve the results. The probably reason may be that modeling temporal importance for each sentence is more difficult and less accurate, and speaker roles in the dialogues provide informative cues for the model to connect the temporal importance from the same speakers together; therefore, the conversational patterns can be considered to additionally improve the understanding results. The further analysis is discussed in Section 4.3. Similarly, the best results are also from the end-to-end trainable universal time-decay function. The significant improvement achieved by the universal functions indicates that our model can effectively learn a suitable attention function through this flexible setting and derive a proper curve to fit the temporal tendency to help the model preserve the essence and drop unimportant parts in the dialogue contexts. To further investigate what the universal time-decay attention learns, we inspect the learned weights w i and find that the convex attention function almost dominates the whole function. In other words, our model automatically learns that the convex timedecay attention is more suitable for modeling contexts from the dialogue data than the other two types. Therefore, we can conclude that in complex dialogues, the recent utterances contain majority of salient information for spoken language understanding, where the attention decay trend follows a convex curve.
We analyze the content-aware attention impact by comparing the results between time-aware only (rows (e)-(f)) and content and time-aware jointly (rows (g)-(h)). The content-aware attention (row (d)) fails to focus on the important contexts for improving understanding performance in the complex dialogues and even performs slightly worse than the contextual model without attention (row (c)). Without a delicately-designed attention mechanism, it is not guaranteed that incorporating an additional content-aware attention would bring better performance and the experimental results show that a simple and coarse content-aware attention barely provides any usable information given the complex dialogues. Therefore, we focus on whether our time-aware attention mechanisms can compensate the poor attention learned from the content-aware model. In other words, we are not going to verify whether our time-aware attention mechanisms could collaborate with the contentaware attention mechanism, instead, we focus on examining how much our proposed time-aware attention could mitigate the detriment of the contentaware attention. By comparing the results between time-aware only (rows (e)-(f)) and content and time-aware jointly (rows (g)-(h)), we find that our universal time-decay attention keeps the improvement without too much performance drop by involving the learned temporal attention. Namely, our proposed attention mechanism can capture temporal information precisely, and it therefore can counteract the harmful impact of inaccurate content-aware attention.

Effectiveness of Role-Level Attention
For role-level attention, Table 1     also demonstrated the effectiveness of considering speaker interactions for better understanding performance. By introducing role-level attention, the sentence-level attentional weights can be smoothed to avoid inappropriate values. Surprisingly, even though learning sentence-level temporal attention is difficult, our proposed universal time-decay attention can achieve similar performance for sentence-level and role-level attention (76.67% and 76.75% from the row (f)), further demonstrating the strong adaptability of fitting diverse dialogue contexts and the capability of capturing salient information.

Robustness to Context Lengths
It is intuitive that longer context brings richer information; however, it may obstruct the attention learning and result in poor performance because more information should be modeled and accurate estimation is not trivial. Because when modeling dialogues, we have no idea about how many contexts are enough for better understanding, the robustness to varying context lengths is important for the contextual model design. Here, we compare the results using different context  lengths (3, 5, 7) for detailed analysis in Table 2, where the number is for each speaker. The models without attention and the content-aware models become slightly worse with increasing context lengths. However, our proposed universal timedecay attention model mostly achieves better performance when including longer contexts, demonstrating not only the flexibility of adapting diverse contextual patterns but also the robustness to varying context lengths.

Universal Time-Decay Attention Analysis
This paper proposes a flexible time-decay attention mechanism by composing three types of timeaware attention functions in different decaying tendencies, where each decaying curves reflect a specific perspectives on distribution over salient information in dialogue contexts. The proposed universal time-decay attention shows great capability of modeling diverse dialogue patterns in the experiments and therefore proves that our proposed method is a general design of time-decay attention. In our design, we endow the attention function with flexibility by employing many trainable parameters and hence it can automatically learn a properly decaying curve for fitting the dialogue contexts better.
To further analyze the combination of different time-decay attention functions, we inspect the converged values of the trainable parameters from the proposed universal time-decay attention models in Table 3. Under the end-to-end trainable setting, the initialization of the trainable parameters are the same as the hand-crafted ones (w i = 1, a = 1, b = 1, e = −0.125, f = 1, D 0 = 5, n = 3). sentence-level or role-level models (w 1 > w 2 and w 1 > w 3 ). Namely, in dialogue contexts, the recent utterances contain most information related to the current utterance, which is aligned with our intuition.

Qualitative Analysis
From the above experiments, the proposed timedecay attention mechanisms significantly improve the performance on both sentence and role levels. To further understand how the time-decay attention changes the content-aware attention, we dig deeper into the learned attentional values for sentences and illustrate the visualization in Figure 3. The figure shows a partial dialogue between the tourist (left) and the guide (right), where the color shades indicate the learned attention intensities of sentences. It can be found that the learned content-aware attention (red; row (c)) focuses on the incorrect sentence ("so we can eat there" (FOL-EXPLAIN)) and hence predicts the wrong label, FOL-INFO. The reason may be that with a coarse and simple design of contentaware attention mechanism, the attention function may not provide additional benefit for improvement. By additionally leveraging our proposed universal time-decay attention methods, the result (blue; row (g)) shows that the adjusted attention pays the highest attention on the most recent utterance and thereby predicts the correct intent, RES-RECOMMEND. It can be found that our proposed time-decay attention can effectively turn the attention to the correct contexts in order to correctly predict the dialogue act and attribute. Therefore, the proposed attention mechanisms are demonstrated to be effective for improving understanding performance in such complex humanhuman conversations.

Conclusion
This paper designs and investigates various timedecay attention functions based on an end-to-end contextual language understanding model, where different perspectives on dialogue contexts are analyzed and a flexible and universal time-decay attention mechanism is proposed. The experiments on a benchmark human-human dialogue dataset show that the understanding performance can be boosted by simply introducing the proposed timedecay attention mechanisms for guiding the model to focus on the salient contexts following a convex curve. Moreover, the proposed universal timedecay mechanisms are easily extensible to multiparty conversations and showing the potential of leveraging temporal information in NLP tasks of dialogues.