Unsupervised Dialogue Act Induction using Gaussian Mixtures

This paper introduces a new unsupervised approach for dialogue act induction. Given the sequence of dialogue utterances, the task is to assign them the labels representing their function in the dialogue. Utterances are represented as real-valued vectors encoding their meaning. We model the dialogue as Hidden Markov model with emission probabilities estimated by Gaussian mixtures. We use Gibbs sampling for posterior inference. We present the results on the standard Switchboard-DAMSL corpus. Our algorithm achieves promising results compared with strong supervised baselines and outperforms other unsupervised algorithms.


Introduction
Modeling the discourse structure is the important step toward understanding a dialogue. The description of the discourse structure is still an open issue. However, some low level characteristics have already been clearly identified, e.g. to determine the dialogue acts (DAs) (Jurafsky and Martin, 2009). DA represents the meaning of an utterance in the context of the full dialogue.
Automatic DA recognition is fundamental for many applications, starting with dialogue systems (Allen et al., 2007). The expansion of social media in the last years has led to many other interesting applications, e.g. thread discourse structure prediction (Wang et al., 2011), forum search (Seo et al., 2009), or interpersonal relationship identification (Diehl et al., 2007).
Supervised approaches to DA recognition have been successfully investigated by many authors (Stolcke et al., 2000;Klüwer et al., 2010;Kalchbrenner and Blunsom, 2013). However, annotating training data is both slow and expensive process. The expenses are increased if we consider different languages and different methods of communication (e.g. telephone conversations, e-mails, chats, forums, Facebook, Twitter, etc.). As the social media and other communication channels grow it has become crucial to investigate unsupervised models. There are, however, only very few related works. Crook et al. (2009) use Chinese restaurant process and Gibbs sampling to cluster the utterances into flexible number of groups representing DAs in a travel-planning domain. The model lacks structural information (dependencies between DAs) and works only on the surface level (it represents an utterance as a word frequency histogram).
Sequential behavior of DAs is examined in (Ritter et al., 2010), where block Hidden Markov model (HMM) is applied to model conversations on Twitter. Authors incorporate a topic model on the top of HMM to distinguish DAs from topical clusters. They do not directly compare the resulting DAs to gold data. Instead, they measure the prediction ability of the model to estimate the order of tweets in conversation. Joty et al. (2011) extend this work by enriching the emission distribution in HMM to also include the information about speaker and its relative position. A similar approach is investigated by Paul (2012). They use mixed-membership Markov model which includes the functionality of topic models and assigns a latent class to each individual token in the utterance. They evaluate on the thread reconstruction task and on DA induction task, outperforming the method of Ritter et al. (2010).
In this paper, we introduce a new approach to unsupervised DA induction. Similarly to previous works, it is based on HMMs to model the struc- tural dependencies between utterances. The main novelty is the use of Multivariate Gaussian distribution for emissions (utterances) in HMM. Our approach allows to represent the utterances as realvalued vectors. It opens up opportunities to design various features encoding properties of each utterance without any modification of the proposed model. We evaluate our model together with several baselines (both with and without supervision) on the standard Switchboard-DAMSL corpus (Jurafsky et al., 1997) and directly compare them with the human annotations.
The rest of the paper is organized as follows. We start with the definition of our model (Sections 2, 3, and 4). We present experimental results in Section 5. We conclude in Section 6 and offer some directions for future work.

Proposed Model
Assume we have a set of dialogues D. Each dialogue d j ∈ D is a sequence of DA utterances , where N j denote the length of the sequence d j . Let N denote the length of corpora N = d j ∈D N j . We model dialogue by HMM with K discrete states representing DAs (see Figure 1). The observation on the states is a feature vector v j,i ∈ R M representing DA utterance d j,i (feature representation is described in Section 4). HMMs thus define the following joint distribution over observations v j,i and states d j,i : We can represent dependency between consecutive HMM states with a set of K multi-nomial distributions θ over K states, such that P (d j,i |d j,i−1 ) = θ d j,i−1 ,d j,i . We assume the probabilities p(v j,i |d j,i ) have the form of Multivariate Gaussian distribution with the mean µ d j,i and covariance matrix Σ d j,i . We place conjugate priors on parameters µ d j,i , Σ d j,i , and θ d j,i−1 : multivariate Gaussian centered at zero for the mean, an inverse-Wishart distribution for the covariance matrix, and symmetric Dirichlet prior for multinomials. We do not place any assumption on the length of the dialogue N j . The full generative process can thus be summarized as follows: 2. For each dialogue d j ∈ D and for each posi- Note that κ and ν represents the strength of the prior for the mean and the covariance, respectively. Ψ is the scale matrix of inverse-Wishart distribution.

Posterior Inference
Our goal is to estimate the parameters of the model in a way that maximizes the joint probability in Equation 1. We apply Gibbs sampling and gradually resample DA assignments to individual DA utterances. For doing so, we need to determine the posterior predictive distribution.
The predictive distribution of Dirichletmultinomial has the form of additive smoothing that is well known in the context of language modeling. The hyper-parameter of Dirichlet prior determine how much is the predictive distribution smoothed. Note that we use symmetrical Dirichlet prior so α in the following equations is a scalar. The predictive distribution for transitions in HMM can be expressed as where n (d j,i |d j,i−1 ) \j,i is the number of times DA d j,i followed DA d j,i−1 . The notation \j, i means to exclude the position i in the j-th dialogue. The symbol • represents any DA so that n The predictive distribution of Normal-inverse-Wishart distribution has the form of multivariate student t-distribution t ν (v|µ , Σ ) with ν degrees of freedom, mean vector µ , and covariance matrix Σ . According to (Murphy, 2012) the parameters for posterior predictive distribution can be estimated as is scaled form of the covariance of these vectors. Note that κ, ν, µ, and Ψ are hyper-parameters which need to be set in advance. Now we can construct the final posterior predictive distribution used for sampling DA assignments: The product of the first two parts in the equation expresses the score proportional to the probability of DA at position i in the j-th dialogue given the surrounding HMM states. The third part expresses the probability of DA assignment given the current feature vector v j,i and all other DA assignments. We also present the simplified version of the model that is in fact the standard Gaussian mixture model (GMM). This model does not capture the dependencies between surrounding DAs in the dialogue. Posterior predictive distribution is as follows: In Section 5 we provide comparison of both models to see the strengths of using DA context.

DA Feature Vector
The real-valued vectors v j,i are expected to represent the meaning of d j,i . We use semantic composition approach. It is based on Frege's principle of compositionality (Pelletier, 1994), which states that the meaning of a complex expression is determined as a composition of its parts, i.e. words.
We use linear combination of word vectors, where the weights are represented by the inversedocument-frequency (IDF) values of words. We use Global Vectors (GloVe) (Pennington et al., 2014) for word vector representation. We use pre-trained word vectors 1 on 6B tokens from Wikipedia 2014 and Gigaword 5. Brychcín and Svoboda (2016) showed that this approach leads to very good representation of short sentences.
For supervised approaches we also use bagof-words (BoW) representation of an utterance, i.e. separate binary feature representing the occurrence of a word in the utterance.

Experimental Results and Discussion
We use Switchboard-DAMSL corpus (Jurafsky et al., 1997) to evaluate the proposed methods. The corpus contains transcriptions of telephone conversations between multiple speakers that do not know each other and are given a topic for discussion. We adopt the same set of 42 DA labels and the same train/test data split as suggested in (Stolcke et al., 2000) 2 .
In our experiments we set κ = 0 , µ = 0, ν = K, Ψ = 1, and α = 50/K. These parameters are recommended by (Griffiths and Steyvers, 2004;Murphy, 2012) and we also confirm them empirically. We always perform 1000 iterations of Gibbs sampling. The number of clusters (mixture size) is K = 42. The dimension of GloVe vectors ranges between M = 50 and M = 300.
DA induction task is in fact the clustering problem. We cluster DA utterances and we assign the same label to utterances within one cluster. Standard metrics for evaluating quality of clusters are purity (PU), collocation (CO), and their harmonic  Table 1: Accuracy (AC), purity (PU), collocation (CO), f-measure (F1), homogeneity (HO), completeness (CM), and v-measure (V1) for proposed models expressed in percents.
mean (F1). In the last years, v-measure (V1) have also become popular. This entropy-based measure is defined as harmonic mean between homogeneity (HO -the precision analogue) and completeness (CM -the recall analogue). Rosenberg and Hirschberg (2007) presents definition and comparison of all these metrics. Note the same evaluation procedure is often used for different clustering tasks, e.g., unsupervised part-of-speech induction (Christodoulopoulos et al., 2010) or unsupervised semantic role labeling (Woodsend and Lapata, 2015). Table 1 presents the results of our experiments. We compare both supervised and unsupervised approaches. Models incorporating the information about surrounding DAs (context) are denoted by prefix ctx. We show the results of three unsupervised approaches: K-means clustering, GMM without context (Eq. 5), and context-dependent GMM (Eq. 4). We use Maximum Entropy (ME) classifier (Berger et al., 1996) for the supervised approach. For the context-dependent version we perform two-round classification: firstly, without the context information and secondly, incorporating the output from the previous round.
In addition, Table 1 provides results for the three extreme cases: random label, majority label, and distinct label for each utterance (a single utterance per cluster). Note the last mentioned achieved v-measure of 42.4%. In this case, however, completeness approaches 0% with the rising size of the test data (so v-measure does too). So this number cannot be taken into account.
To the best of our knowledge, the best performing supervised system on Switchboard-DAMSL corpus is presented in (Kalchbrenner and Blunsom, 2013) and achieves accuracy of 73.9%. Our best supervised baseline is approximately 1% worse. In all experiments the context information proved to be very useful. The best result among unsupervised models is achieved with 300-dimensional GloVe (F1 score 65.7% and vmeasure 41.2%). We outperform both the block HMM (BHMM) (Ritter et al., 2010) achieving F1 score 41.1% and v-measure 34.7% and mixedmembership HMM (M4) (Paul, 2012) achieving F1 score 45.1% and v-measure 18.0% 3 . If we compare our method with the supervised version (F1 score 74.5% and v-measure 58.6%) we can state that HMM with GMMs is very promising direction for the unsupervised DA induction task.

Conclusion and Future Work
We introduced HMM based model for unsupervised DA induction.
We represent each utterance as a real-valued vector encoding the meaning.
Our model predicts these vectors in the context of DA utterances. We compared our model with several strong baselines and showed its strengths.
As the main direction for future work, we plan to experiment with more languages and more corpora. Also, more thorough study of feature vector representation should be done.
We plan to investigate the learning process much more deeply. It was beyond the scope of this paper to evaluate the time expenses of the algorithm. Moreover, there are several possibilities how to speed up the process of parameter estimation, e.g. by Cholesky decomposition of the covariance matrix as described in (Das et al., 2015). In our current implementation the number of DAs is set in advance. It could be very interesting to use non-parametric version of GMM, i.e. to change the sampling scheme to estimate the number of DAs by Chinese restaurant process.