Preserving Distributional Information in Dialogue Act Classification

This paper introduces a novel training/decoding strategy for sequence labeling. Instead of greedily choosing a label at each time step, and using it for the next prediction, we retain the probability distribution over the current label, and pass this distribution to the next prediction. This approach allows us to avoid the effect of label bias and error propagation in sequence learning/decoding. Our experiments on dialogue act classification demonstrate the effectiveness of this approach. Even though our underlying neural network model is relatively simple, it outperforms more complex neural models, achieving state-of-the-art results on the MapTask and Switchboard corpora.


Introduction
Dialogue Act (DA) classification is a sequencelabeling task, where a sequence of utterances is mapped into a sequence of DAs. The DAs are semantic classifications of the utterances, and different corpora usually have their own DA labels.
Two of the most popular DA classification datasets are Switchboard (Godfrey et al., 1992;Jurafsky et al., 1997) and MapTask (Anderson et al., 1991). There have been many works on DA classification applied to these two datasets; some focus on textual data (Kalchbrenner and Blunsom, 2013;Stolcke et al., 2000), while others explore speech data (Julia et al., 2010). The classification methods used can be broadly divided into instancebased methods (Julia et al., 2010;Gambäck et al., 2011) and sequence-labeling methods (Stolcke et al., 2000;Kalchbrenner and Blunsom, 2013;Ji et al., 2016;Shen and Lee, 2016;Tran et al., 2017). Instance-based methods treat each utterance as an independent data point, which allows the application of general machine learning models, such as Support Vector Machines. Sequencelabeling methods include methods based on Hidden Markov Models (HMMs) (Stolcke et al., 2000) and neural networks (Kalchbrenner and Blunsom, 2013;Ji et al., 2016;Shen and Lee, 2016;Tran et al., 2017). Stolcke et al. employed an HMM, using a Language Model to produce emission probabilities. The neural models are particularly successful, posting a higher accuracy on Switchboard than the HMM. Specifically, Kalchbrenner and Blunsom (2013) model a DA sequence with a recurrent neural network (RNN) where sentence representations are constructed by means of a convolutional neural network (CNN); Ji et al. (2016) treat the labels as latent variables in a generative RNN; Shen and Lee (2016) employ attentional RNNs for the independent prediction of DAs; and Tran et al. (2017) model the DAs in a conversation by means of a hierarchical RNN. In this paper, we also rely on RNNs, but our architecture is much simpler than the above neural models, while posting competitive results.
Most neural network models for DA classification employ greedy decoding (Tran et al., 2017;Ji et al., 2016), as its speed and simplicity support an on-line decoding process (i.e., producing a label immediately after receiving an utterance). For sequential labeling, the DA label in the current time step is very important (Tran et al., 2017). However, using a greedy approach to connect the current label directly to the next label may degrade performance, because the current predicted label may be noisy, which in turn leads to the propagation of errors through the sequence (Tran et al., 2017;Ranzato et al., 2015).
Recently, Bengio et al. (2015) proposed a technique called Scheduled Sampling that tries to solve the label-bias problem by alternating between the predicted label and the correct label during training. This makes the model gradually adapt to the noisiness of the predicted label. However, this method still relies upon a single current label, and, by omitting the distribution over the possible labels, this model loses information about the current stage. In contrast, we propose to condition the next label on a predicted distribution of the current label. Specifically, we introduce two variants of this idea: the Uncertainty Propagation model and the Average Embedding model.

Sequential DA Prediction
We are interested in predicting DAs {z 1 , . . . , z t } in a conversation as we receive textual utterances {x x x 1 , . . . , x x x t } sequentially. Importantly, we do not have access to future utterances when predicting a DA at time t.
Model. We propose a discriminative model, where the probability of DAs conditioned on utterances is decomposed as follows ( Figure 1): where z z z 1:t and x x x 1:t respectively denote the sequence of DAs and utterances up to time step t. Our model resembles a maximum entropy Markov model, as it conditions the label of the next time step on the label of the current step and the next received utterance. The conditional distribution term p θ θ θ (z i |z i−1 , x x x i ) is realised by neural models as follows: The encoding function for an utterance is an RNN with long-short term memory (LSTM) units (Graves, 2013;Hochreiter and Schmidhuber, 1997), where the final hidden state of the RNN is taken as the representation of the whole sequence of text: where x t,n is the n-th token in the t-th utterance, and N t is the length of the utterance.
The parameter set of our model θ θ θ includes for the gating component (where L is the number of DAs), as well as the LSTM parameters φ φ φ and the word-embedding table {e e e(w)} w∈W , where W is the dictionary.
Uncertainty Propagation. In this model, the distribution over the labels at the current time step is passed to the next time step. Specifically, the quantity of interest is the posterior probability of the DA of the next time step given all the utterances observed so far. This posterior probability can be rewritten as According to Equation 4, the label uncertainty at the next time step t can be computed by a dynamic programming algorithm based on the label uncertainty of the current time step combined with the local potentials p θ θ θ (z t |z t−1 , x x x t ).
The use of posterior probability for prediction is also motivated by the minimum Bayes risk decoding (MBR). In the sequential setting, we are interested in predicting the next DA that minimizes the expected loss arg min whereẑ t is the predicted label, z t is the actual label, and loss(z t ,ẑ t ) = 1 zt =ẑt . In addition to decoding, we use posterior probability when training the model. That is, our training objective is where D is the set of conversations in the training set, each consisting of a sequence of utterances x x x 1:T annotated with its gold sequence of DAs z z z 1:T .
Average Embedding. This model offers a new perspective where a neural net combines an inference machine and a model (rather than simply encoding a model). Specifically, this model represents in its architecture, through a weighted sum of embeddings, the inference procedure encoded in Equation 4 for the Uncertainty Propagation model: where q(z t ) is an embedding that represents the uncertainty at time step t. q(z t ) is computed sequentially as new utterances are received, and used in both decoding and training.
This formulation contrasts with Uncertainty Propagation, where the expectation is over the distributions: (8) It is worth noting that Equations 7 and 8 yield the same result if the distributions involved in calculating the expectations are point-mass distributions and they are equal.
Although we could have used a more elaborate neural architecture as the inference machine for the Average Embedding model, we employed a simple softmax architecture to make this model comparable with the principled inference algorithm for our Uncertainty Propagation model, which is based on Equation 4.

Comparison to traditional graphical models
Our models have several similarities with the traditional HMM model and inference algorithms, such as Forward-Backward decoding and the Viterbi algorithm. However, there are some key differences. Firstly, our model is discriminative, whereas HMM is generative. Secondly, our method is designed for online decoding (the future inputs to a specific classification decision are unknown), whereas both Forward-Backward decoding and Viterbi require access to the whole sequence. Thirdly, Viterbi's objective is to decode for the most probable sequence of labels, whereas our decoding algorithm's objective is to find the sequence of most probable labels (conditioned on the inputs observed so far). Lastly, our Uncertainty Propagation model is not only a basis for decoding, but also for training (the training objective in Equation 6 requires the calculation of the posterior probability in Equation 4). Overall, the best analogue of our Uncertainty Propagation model to methods used in HMMs and other graphical models is the forward message calculation in the Forward-Backward algorithm.

Data sets
For our experiments, we use the MapTask and Switchboard corpora.
The MapTask Dialog Act corpus (Anderson et al., 1991) consists of 128 conversations tagged with 13 DAs. The MapTask conversations focus on instructions and clarifications -in the Map-Task experiment, there is one instruction giver and one instruction follower. The task of the instruction giver is to guide the instruction follower to follow a pre-determined path, and the instruction follower must draw this path on his/her map. We use 12 conversations for validation, 13 for testing, and the rest for training.
The Switchboard Dialog Act corpus (Godfrey et al., 1992;Jurafsky et al., 1997) consists of 1155 transcribed telephone conversations about general topics, encoded into 42 DAs. We use the experimental setup proposed by Stolcke et al. (2000): 1115 conversations for training and 19 for testing.

Baselines
Our first baseline is the model without any current label information. Next, we compare our models with other strategies for incorporating the current labels, viz those that use predicted label in training, and those that use correct label. These models simply employ the predicted/correct label to gate the parameters in Equation 2 during training. During testing, both models can only use the predicted label.
Another baseline is Bengio et al.'s (2015) Scheduled Sampling technique, where the training model uses the current correct label with probability p and the predicted label with probability 1 − p, and p is scheduled to decrease over time. This strategy tries to solve the label-bias problem by making the model gradually adapt to the noisy predicted current label.

Results
Table 1 compares our results with those obtained by the baselines. Our two models, Uncertainty Propagation and Average Embedding, outperform all the baselines. Among these two models, Uncertainty Propagation, which is more analytically grounded, outperforms the Average Embedding model. Using the true current label during training seems to degrade performance compared to using the predicted label, which is expected, since the true label is not available during testing. The Scheduled Sampling method performs similarly to the predicted-label method for the MapTask corpus, and outperforms this method for the Switchboard corpus.
Tables 2 and 3 compare our models' performance on the MapTask and Switchboard corpora respectively with that of several strong baselines. On MapTask, we achieved the best results for Baseline models Accuracy Julia et al. (2010) 55.4 % Surendran and Levow (2006) 59.1% Tran et al. (2017) 61.6% Our models: Average Embedding 62.6% Uncertainty Propagation 62.9% Table 2: Results on MapTask data.
textual input, using the four-fold cross-validation setup used by Surendran and Levow (2006) and Julia et al. (2010). On Switchboard, we also obtained the best results among the systems with the same experimental setting. It is worth noting that Ji et al. (2016) reported a higher accuracy of 77.0%, but the paper does not provide enough information about the experimental setup to replicate this result, and we only got 72.5% accuracy using the paper's publicly available code.

Analysis
To quantify the effectiveness of the different models on reducing the label-bias problem, we calculate the probability of the models making a correct prediction after they have made a sequence of n mistakes. We expect our models, Uncertainty Propagation and Average Embedding, to be more robust than the label-sensitive baselines in recovering from errors. The results in Table 4 confirm our expectations. The simple model with no current label, while performing worse than all other models in accuracy, does not suffer from the label-bias problem. Among the models with current label information, Uncertainty Propagation suffers the least from label bias. It even outperforms the model with no current label on Switchboard for all values of n, and on MapTask for n = 2. Interestingly, Average Embedding performs quite well for n = 1, but 2154 MapTask Switchboard n = 1 n = 2 n = 3 n = 1 n = 2 n = 3 Not affected by label bias: No  Table 4: Probability that the models recover from a sequence of n prediction mistakes.
its ability to recover from errors drops quickly as the length of the erroneous conditioning sequence increases, especially on Switchboard, where the number of labels is higher. This may explain its slightly lower accuracy compared to the Uncertainty Propagation model. However, in general, the difference in accuracy between these two models is small, because they are rather unlikely to make several consecutive errors.

Conclusion
In this paper, we proposed two strategies to encode current label uncertainty in sequencelabeling RNN models. The experimental results show that our models achieve a very strong performance on the MapTask and Switchboard corpora using a simple underlying RNN architecture.
Although we experimented with DA classification, the idea presented in this paper is general, and can be applied to many sequence-labeling tasks. Our approach is particularly suitable for tasks involving streaming data where the model only has access to current and previous observations.
In the future, we plan to combine our strategies with more complex neural architectures, and explore their application to other sequence-labeling problems.