Neural-based Context Representation Learning for Dialog Act Classification

We explore context representation learning methods in neural-based models for dialog act classification. We propose and compare extensively different methods which combine recurrent neural network architectures and attention mechanisms (AMs) at different context levels. Our experimental results on two benchmark datasets show consistent improvements compared to the models without contextual information and reveal that the most suitable AM in the architecture depends on the nature of the dataset.


Introduction
The study of spoken dialogs between two or more speakers can be approached by analyzing the dialog acts (DAs), which is the intention of the speaker at every utterance during a conversation. Table 1 shows a fragment of a conversation from the Switchboard (SwDA) dataset with DA annotation. Automatic DA classification is an important pre-processing step in natural language understanding tasks and spoken dialog systems. This classification task has been approached using traditional statistical methods such as hidden Markov models (HMMs) (Stolcke et al., 2000), conditional random fields (CRF) (Zimmermann, 2009) and support vector machines (SVMs) (Henderson et al., 2012). However, recent works with deep learning (DL) techniques have brought state-ofthe-art models in DA classification, such as convolutional neural networks (CNNs) (Kalchbrenner and Blunsom, 2013;Lee and Dernoncourt, 2016), recurrent neural networks (RNNs) (Lee and Dernoncourt, 2016;Ji et al., 2016) and long short-term memory (LSTM) models (Shen and .

Utterance
Dialog act A: Are you a musician yourself? Yes-no-question B: Uh, well, I sing.
Affirmative non-yes answer A: Uh-huh.
Acknowledge (Backchannel) B: I don't play an instrument.
Statement-non-opinion Table 1: Examples from the SwDA dataset.
Given an utterance in a dialog without any previous context, it is not always obvious even for human beings to find the corresponding dialog act. In many cases, the utterances are too short so that is hard to classify them, for example the utterance 'Right' can be either an Agreement or a Backchannel indicating the interlocutor to go on talking, in this case the context plays a key role at disambiguating. Therefore, using context information from the previous utterances in a dialog flow is a crucial step for improving DA classification. Few papers in the literature have suggested to utilize context as a potential knowledge source for DA classification (Lee and Dernoncourt, 2016;Shen and Lee, 2016). Recently, Ribeiro et al. (2015) presented an extensive analysis of the influence of context on DA recognition concluding that contextual information from preceding utterances helps to improve the classification performance. Nonetheless, such information should be differentiable from the current utterance information, otherwise, the contextual information could have a negative impact.
Attention mechanisms (AMs) introduced by  have contributed to significant improvements in many natural language processing tasks, for instance machine translation , sentence classification (Shen and  and summarization (Rush et al., 2015), uncertainty detection (Adel and Schütze, 2017), speech recognition (Chorowski et al., 2015), sentence pair modeling (Yin et al., 2015), question-answering (Golub and He, 2016), document classification (Yang et al., 2016) and entailment (Rocktäschel et al., 2015) . AMs let the model decide what parts of the input to pay attention to according to the relevance for the task.
In this paper, we explore the use of AMs to learn the context representation, as a manner to differentiate the current utterance from its context as well as a mechanism to highlight the most relevant information, while ignoring unimportant parts for DA classification. We propose and compare extensively different neural-based methods for context representation learning by leveraging a recurrent neural network architecture with LSTM (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRUs) Chung et al., 2014) in combination with AMs.

Model
The model architecture, shown on the left side of Figure 1, contains two main parts: the CNN-based utterance representation and the attention mechanism for context representation learning. Finally, the context representation is fed into a softmax layer which outputs the posterior of each predefined DA given the current dialog utterance.

CNN-based Dialog Utterance Representation
We used CNNs for the representation of each utterance. CNNs perform a discrete convolution on an input matrix with a set of different filters. For the DA classification task, the input matrix represents a dialog utterance and its context, this is n previous utterances: each column of the matrix stores the word embedding of the corresponding word. We use 2D filters f (with width |f |) spanning all embedding dimensions d. This is described by the following equation: (1) After convolution, a max pooling operation is applied that stores only the highest activation of each filter. Furthermore, we apply filters with different window sizes 3-5 (multi-windows), i.e. spanning a different number of input words. Then, all feature maps are concatenated to one vector which represents the current utterance and its context.

Internal Attention Mechanism
Attention mechanisms can be applied in different sequences of input vectors, e.g. representations of consecutive dialog utterances. For each of the input vectors u(t − i) at time step t − i in a dialog and t is the current time step, the attention weights α i are computed as follows where f is the scoring function. In this work, f is the linear function of the input where W is a trainable parameter. The output attentive u after the attention layer is the weighted sum of the input sequence.
Another option (order-preserved attention as proposed in Adel and Schütze (2017)) is to store the weighted inputs into a vector sequence attentive v which preserves the order information. attentive

Neural-based Context Modeling
In this subsection, we present different methods, depicted on the right side of Figure 1, to learn the context representation.
a Max We apply max-pooling on top of the dialog utterance representations which spans all the contexts and the vector dimension.
b Attention We apply directly attention mechanism on the dialog utterance representations. The weighted sum of all the dialog utterances represents the context information.
c RNN We introduce a recurrent architecture with LSTM or GRU cells on top of the dialog utterance representations to model the relation between the context and the current utterance over time. The output of the hidden layer of the last state is the context representation.  e RNN-Input-Attention We first apply the order-preserved attention mechanism on the dialog utterance representations to obtain a sequence of weighted inputs. Afterwards, an RNN with LSTM or GRU cells is introduced to model the relation of the weighted context.
Train, validation and test splits on both datasets were taken as defined in Lee and Dernoncourt (2016) 1 , summary statistics are shown in Table 2. In both datasets the classes are highly unbalanced, the majority class is 59.1% on MRDA and 33.7 % on SwDA.

Hyperparameters and Training
The hyperparameters for both datasets are summarized in Table 3  ing one hyperparameter at a time while keeping the others fixed. The filter widths and feature maps were taken from the CNN architecture for sentence classification in Kim (2014). Dropout rate of 0.5 was found to be the most effective in the range of [0-0.9]. The rectified linear unit (ReLU) was used as non-linear activation function, 1-max as pooling operation at utterance level as suggested in Zhang and Wallace (2015). The only dataset specific hyperparameter is the minibatch size: 150 and 50 for SwDA and MRDA, respectively. Word2vec (Mikolov et al., 2013) was used for word vector representation. Training was done for 30 epochs with averaged stochastic gradient descent (Polyak and Juditsky, 1992) over minibatches. The learning rate was initialized at 0.1 and reduced 10% every 2000 parameter updates. We kept the word vector unchanged during training. The context length was optimized on the development set, ranging from 1-5. Our best results were obtained with three context utterances for MRDA and two for SwDA.

Baseline Models
We define two models as baseline, both are a onelayer CNN for sentence classification based on  Kim (2014) but with an input variation: a) Baseline I: The input is a single utterance a time without any contextual information and b) Baseline II: The input is the concatenation of the current utterance and previous utterances. Table 4 summarizes the results of all the models. Results on the Baseline I and the Baseline II on both datasets show that a simple context concatenation is not enough to model the context information for this task. While on SwDA the accuracy improves by 1.3%, it slightly drops on MRDA.

Results
Other simple methods such as Max and Attention do not improve the results over the baseline either.
Our results are consistently improved on both datasets after introducing RNN architecture to model the relation between the contexts. It indicates that hierarchical structure is crucial to learn the context representation. Attention mechanisms contribute to the overall improvements. On MRDA, the AM was more useful when it was applied to the inputs of the RNN, whereas on SwDA when it was applied to the outputs. Our intuition is that in multiparty dialogs the dependency between the utterances should be weighted before being processed by the RNN.  Table 4: Accuracy (%) of baselines and models with different context processing methods.

Impact of Context Length
Our experiments revealed that context length plays an important role for DA classification and the best length is corpus dependent. By experimenting in the context range of 0-5 utterances, we found that the best context length for MRDA is three utterances and two for SwDA. Table 5 shows the results at different context lengths.     (Ji and Bilmes, 2006). LV-RNN: latent variable RNN with conditional training . HCNN: hierarchical CNN (Kalchbrenner and Blunsom, 2013). CA-LSTM: contextual attentive LSTM (Shen and . HMM Stolcke et al. (2000).

Conclusions
We explored different neural-based context representation learning methods for dialog act classification which combine RNN architectures with attention mechanisms at different context levels.
Our results on two benchmark datasets reveal that using RNN architecture is important to learn the context representation. Moreover, attention mechanisms contribute to the overall improvements, however, the place where AM should be applied depends on the nature of the dataset.