Hidden Softmax Sequence Model for Dialogue Structure Analysis

We propose a new unsupervised learning model, hidden softmax sequence model (HSSM), based on Boltzmann machine for dialogue structure analysis. The model employs three types of units in the hidden layer to discovery dialogue latent structures: softmax units which represent latent states of utterances; binary units which represent latent topics speciﬁed by dialogues; and a binary unit that represents the global general topic shared across the whole dialogue corpus. In addition, the model contains extra connections between adjacent hidden softmax units to formulate the dependency between latent states. Two different kinds of real world dialogue corpora, Twitter-Post and AirTicketBook-ing, are utilized for extensive comparing experiments, and the results illustrate that the proposed model outperforms sate-of-the-art popular approaches.


Introduction
Dialogue structure analysis is an important and fundamental task in the natural language processing domain. The technology provides essential clues for solving real-world problems, such as producing dialogue summaries (Murray et al., 2006;Liu et al., 2010), controlling conversational agents (Wilks, 2006), and designing interactive dialogue systems (Young, 2006;Allen et al., 2007) etc. The study of modeling dialogues always assumes that for each dialogue there exists an unique latent structure (namely dialogue structure), which consists of a series of latent states. 1 Some past works mainly rely on supervised or semi-supervised learning, which always involve extensive human efforts to manually construct latent state inventory and to label training samples. Cohen et al. (2004) developed an inventory of latent states specific to E-mail in an office domain by inspecting a large corpus of e-mail. Jeong et al. (2009) employed semi-supervised learning to transfer latent states from labeled speech corpora to the Internet media and e-mail. Involving extensive human efforts constrains scaling the training sample size (which is essential to supervised learning) and application domains.
In recent years, there has been some work on modeling dialogues with unsupervised learning methods which operate only on unlabeled observed data. Crook et al. (2009) employed Dirichlet process mixture clustering models to recognize latent states for each utterance in dialogues from a travel-planning domain, but they do not inspect dialogues' sequential structure. Chotimongkol (2008) proposed a hidden Markov model (HMM) based dialogue analysis model to study structures of task-oriented conversations from indomain dialogue corpus. More recently, Ritter et al. (2010) extended the HMM based conversation model by introducing additional word sources for topic learning process. Zhai et al. (2014) assumed words in an utterance are emitted from topic models under HMM framework, and topics were shared across all latent states. All these dialogue structure analysis models are directed generative models, in which the HMMs, language models and topic models are combined together.
In this study, we attempt to develop a Boltzmann machine based undirected generative model for dialogue structure analysis. As for the document modeling using undirected generative model, Hinton and Salakhutdinov (2009) proposed a general framework, replicated soft-max model (RSM), for topic modeling based on restricted Boltzmann machine (RBM). The model focuses on the document-level topic analysis, it cannot be applied for the structure analysis. We propose a hidden softmax sequence model (HSSM) for the dialogue modeling and structure analysis. HSSM is a two-layer special Boltzmann machine. The visible layer contains softmax units used to model words in a dialogue, which are the same with the visible layer in RSM (Hinton and Salakhutdinov, 2009). However, the hidden layer has completely different design. There are three kinds of hidden units: softmax hidden units, which is utilized for representing latent states of dialogues; binary units used for representing dialogue specific topics; and a special binary unit used for representing the general topic of the dialogue corpus. Moreover, unlike RSM whose hidden binary units are conditionally independent when visible units are given, HSSM has extra connections utilized to formulate the dependency between adjacent softmax units in the hidden layer. The connections are the latent states of two adjacent utterances. Therefore, HSSM can be considered as a special Boltzmann machine.
The remainder of this paper is organized as follows. Section 2 introduces two real world dialogue corpora utilized in our experiments. Section 3 describes the proposed hidden softmax sequence model. Experimental results and discussions are presented in Section 4. Finally, Section 5 presents our conclusions.

Data Set
Two different datasets are utilized to test the effectiveness of our proposed model: a corpus of post conversations drawn from Twitter (Twitter-Post), and a corpus of task-oriented human-human dialogues in the airline ticket booking domain (AirTicketBooking).

Twitter-Post
Conversations in Twitter are carried out by replying or responding to specific posts with short 140-character messages. The post length restriction makes Twitter keep more chat-like interactions than blog posts. The style of writing used on Twitter is widely varied, highly ungrammatical, and often with spelling errors. For example, the terms "be4", "b4", and "bef4" are always appeared in the Twitter posts to represent the word "before".
Here, we totally collected about 900, 000 raw Twitter dialogue sessions. The majority of conversation sessions are very short; and the frequencies of conversation session lengths follow a power law relationship as described in (Ritter et al., 2010). For simplicity , in the data preprocessing stage non-English sentences were dropped; and non-English characters, punctuation marks, and some non-meaning tokens (such as "&") were also filtered from dialogues. We filtered short Twitter dialogue sessions and randomly sampled 5,000 dialogues (the numbers of utterances in dialogues rang from 5 to 25) to build the Twitter-Post dataset.

AirTicketBooking
The AirTicketBooking corpus consists of a set of task-oriented human-human mandarin dialogues from an airline ticket booking service center. The manual transcripts of the speech dialogues are utilized in our experiments. In the dataset, there is always a relative clear structure underlying each dialogue. A dialogue often begins with a customer's request about airline ticket issues. And the service agent always firstly checks the client's personal information, such as name, phone number and credit card numberm, etc. Then the agent starts to deal with the client's request. We totally collected 1,890 text-based dialogue sessions obtaining about 40,000 conversation utterances with length ranging from 15 to 100. We design an undirected generative model based on Boltzmann machine. As we known, dialogue structure analysis models are always based on an underlying assumption: each utterance in the dialogues is generated from one latent state, which has a causal effect on the words. For instance, an utterance in AirTicketBooking dataset, "Tomorrow afternoon, about 3 o'clock" corre-sponds to the latent state "Time Information". However, by carefully examining words in dialogues we can observe that not all words are generated from the latent states (Ritter et al., 2010;Zhai and Williams, 2014). There are some words relevant to a global or background topic shared across dialogues. For example, "about" and "that" belong to a global (general English) topic. Some other words in a dialogue may be strongly related to the dialogue specific topic. For example, "cake", "toast" and "pizza" may appear in a Twitter dialogue with respect to a specific topic, "food". From the perspective of generative model, we can also consider that words in a dialogue are generated by the mixture model of latent states, a global/background topic, and a dialogue specific topic. Therefore, there are three kinds of units in the hidden layer of our proposed model, which are displayed in Figure 1. h φ is a softmax unit, which indicates the latent state for a utterance. h ψ and h ξ represent the general topic, and the dialogue specific topic, respectively. For the visible layer, we utilize the softmax units to model words in each utterance, which is the same with the approach in RSM (Hinton and Salakhutdinov, 2009). In Section 3.2, We propose a basic model based on Boltzmann machine to formulate each word in utterances of dialogues.
A dialogue can be abstractly viewed as a sequence of latent states in a certain reasonable order. Therefore, formulating the dependency between latent states is another import issue for dialogue structure analysis. In our model, we assume that each utterance's latent state is dependent on its two neighbours. So there exist connections between each pair of adjacent hidden softmax units in the hidden layer. The details of the model will be presented in Section 3.3.   consists of three types of hidden units: softmax hidden units used for representing latent states, a binary stochastic hidden unit used for representing the dialogue specific topic, and a special binary stochastic hidden unit used for representing corpus general topic. Upper: The model for a dialogue session containing three utterances. Connection lines in the same color related to a latent state represent the same weight matrix. Lower: A different interpretation of the Hidden Softmax Model, in which D r visible softmax units in the r th utterance are replaced by one single multinomial unit which is sampled D r times. Table 1 summarizes important notations utilized in this paper. Before introducing the ultimate learning model for dialogue structure analysis, we firstly discuss a simplified version, Hidden Softmax Model (HSM), which is based on Boltzmann machine and assumes that the latent variables are independent given visible units. HSM has a twolayer architecture as shown in Figure 2. The energy of the state {V, h φ , h ψ , h ξ } is defined as follows:

HSM: Hidden Softmax Model
are sub-energy functions related to hidden variables h φ , h ψ , and h ξ , respectively. C(V) is the shared visible units bias term. Suppose K is the dictionary size, D r is the r th utterance size (i.e. the number of words in the r th utterance), and R is the number of utterances in the a dialogue. For each utterance v r (r = 1, .., R) in the dialogue session we have a hidden variable vector h φ r (with size of J ) as a latent state of the utterance, the sub-energy functionĒ where v rik = 1 means the i th visible unit v ri in the r th utterance takes on k th value, h φ rj = 1 means the r th softmax hidden units takes on j th value, and a φ rj is the corresponding bias. W φ rjik is a symmetric interaction term between visible unit v ri that takes on k th value and hidden variable h φ r that takes on j th value.
The sub-energy functionĒ ψ (V, h ψ ), related to the global general topic of the corpus, is defined byĒ The sub-energy functionĒ ξ (V, h ξ ) corresponds to the dialogue specific topic, and is defined bȳ W ψ rik in Eq.
(3) and W ξ rik in Eq. (4) are two symmetric interaction terms between visible units and the corresponding hidden units, which are similar to W φ rjik in (2); a ψ and a ξ are the corresponding biases. C(V) is defined by where b rik is the corresponding bias.
The probability that the model assigns to a vis- where Z is known as the partition function or normalizing constant.
In our proposed model, for each word in the document we use a softmax unit to represent it. For the sake of simplicity, assume that the order of words in an utterance is ignored. Therefore, all of these softmax units can share the same set of weights that connect them to hidden units, thus the visible bias term C(V) and the sub-energy func- (1) can be redefined as follows: wherev rk = Dr i=1 v rik denotes the count for the k th word in the r th utterance of the dialogue,v k = R r=1v rk is the count for the k th word in whole dialogue session. D r and D (D = R r=1 D r ) are employed as the scaling parameters, which can make hidden units behave sensibly when dealing with dialogues of different lengths (Hinton and .

HSSM: Hidden Softmax Sequence Model
In this section, we consider the dependency between the adjacent latent states of utterances, and extend the HSM to hidden softmax sequence model (HSSM), which is displayed in Figure 3. We define the energy of the state {V, h φ , h ψ , h ξ } in HSSM as follows: where is utilized to formulate the dependency between latent variables h φ , which is defined as follows: where h φ s and h φ e are two constant scalar variables (h φ s ≡ 1, h φ e ≡ 1), which represent the virtual beginning state unit and ending state unit of a dialogue. F s is a vector with size J, and its elements measure the dependency between h φ s and the latent softmax units of the first utterance. F e also contains J elements, and in contrast to F s , F e represents the dependency measure between h φ e and the latent softmax units of the last utterance. F is a symmetric matrix for formulating dependency between each two adjacent hidden units pair (h φ r , h φ r+1 ), r = 1, ..., R − 1.
Utterance 1 Utterance 2 Utterance 3 Figure 3: Hidden softmax sequence model. A connection between each pair of adjacent hidden softmax units is added to formulate the dependency between the two corresponding latent states.

Parameter Learning
Exact maximum likelihood learning in the proposed model is intractable. "Contrastive Divergence" (Hinton, 2002) can be used for HSM's learning, however, it can not be utilized for HSSM, because the hidden-to-hidden interaction term, {F, F s , F e }, result in the intractability when obtaining exact samples from the conditional distribution P (h φ rj = 1|V), r = [1, R], j ∈ [1, J]. We use the mean-field variational inference (Hinton and Zemel, 1994;Neal and Hinton, 1998;Jordan et al., 1999) and a stochastic approximation procedure (SAP) (Tieleman, 2008) to estimate HSSM's parameters. The variational learning is utilized to get the data-dependent expectations, and SAP is utilized to estimate the model's expectation. The log-likelihood of the HSSM has the following variational lower bound: (17) Q(h) can be any distribution of h in theory. θ = {W φ , W ψ , W ξ , F, F s , F e } (the bias terms are omitted for clarity) are the model parameters. h = {h φ , h ψ , h ξ } represent all the hidden variables. H(·) is the entropy functional. In variational learning, we try to find parameters that minimize the Kullback-Leibler divergences between Q(h) and the true posterior P (h|V; θ). A naive mean-field approach can be chosen to obtain a fully factorized distribution for Q(h): where q(h φ rj = 1) = µ φ rj , q(h ψ = 1) = µ ψ , q(h ξ = 1) = µ ξ . µ = {µ φ , µ ψ , µ ξ } are the parameters of Q(h). Then the lower bound on the log-probability log P (V; θ) has the form: , and E Φ (µ φ , µ φ ) have the same forms, by replacing µ with h, as Eqs. (7), (8), (9), and (16), respectively. We can maximize this lower bound with respect to parameters µ for fixed θ, and obtain the meanfield fixed-point equations: where D j prev and D j next are two terms relevant to the derivative of the RHS of Eq. (19) with respect to µ φ rj , defined by The updating of µ can be carried out iteratively until convergence. Then, (V, µ) can be considered as a special "state" of HSSM, thus the SAP can be applied to update the model's parameters, θ, for fixed (V, µ).

Experiments and Discussions
It's not easy to evaluate the performance of a dialogue structure analysis model. In this study, we examined our model via qualitative visualization and quantitative analysis as done in (Ritter et al., 2010;Zhai and Williams, 2014). We implemented five conventional models to conduct an extensive comparing study on the two corpora: Twitter-Post and AirTicketBooking. Conventional models include: LMHMM (Chotimongkol, 2008), LMH-MMS (Ritter et al., 2010), TMHMM, TMHMMS, and TMHMMSS (Zhai and Williams, 2014). In our experiments, for each corpus we randomly select 80% dialogues for training, and use the rest 20% for testing. We select three different number (10, 20 and 30) of latent states to evaluate all the models. In TMHMM, TMHMMS and TMH-MMSS, the number of "topics" in the latent states and a dialogue is a hyper-parameter. We conducted a series of experiments with varying numbers of topics, and the results illustrated that 20 is the best choice on the two corpora. So, for all the following experimental results of TMHMM, TMHMMS and TMHMMSS, the corresponding topic configurations are set to 20.
The number of estimation iterations for all the models on training sets is set to 10,000; and on held-out test sets, the numver of iterations for inference is set to 1000. In order to speed-up the learning of HSSM, datasets are divided into minibatches, each has 15 dialogues. In addition, the learning rate and momentum are set to 0.1 and 0.9, respectively.

Qualitative Evaluation
Dialogues in Twitter-Post always begin with three latent states: broadcasting what they (Twitter users) are doing now ("Status"), broadcasting an interesting link or quote to their followers ("Reference Broadcast"), or asking a question to their followers ("Question to Followers"). 2 We find that structures discoverd by HSSM and LMHMMS with 10 latent states are most reasonable to interpret. For example, after the initiating state ("Status", "Reference Broadcast", or "Question to Followers"), it was often followed a "Reaction" to "Reference Broadcast" (or "Status"), or a "Comment" to "Status", or a "Question" to "Status" ( "Reference Broadcast", or "Question to Followers"') etc. Compared with LMHMMS, besides obtaining similar latent states, HSSM exhibits powerful ability in learning sequential dependency relationship between latent states. Take the following simple Twitter dialogue session as an example: : rt i like katy perry lt lt we see tht lol LMHMMS labelled the second utterance ("lol gd morning ") and the third utterance ("lol good morning how u " ) into the same latent state, while HSSM treats them as two different latent states (Though they both have almost the same words). The result is reasonable: the first "gd morning" is a greeting, while the second "gd morning" is a response.
For AirTicketBooking dataset, the statetransition diagram generated with our model under the setting of 10 latent states is presented in Figure 4. And several utterance examples corresponding to the latent staes are also showed in Table 2. In general, conversations begin with sever agent's short greeting, such as "Hi, very glad to be of service.", and then transit to checking the passenger's identity information or inquiring the passenger's air ticket demand; or it's directly interrupted by the passenger with booking demand which is always associated with place information. After that, conversations are carried out with other booking related issues, such as checking ticket price or flight time.
The flowchart produced by HSSM can be reasonably interpreted with knowledge of air ticket booking domain, and it most consistent with the agent's real workflow of the Ticket Booking Corporation 3 compared with other models. We notice that conventional models can not clearly distinguish some relevant latent states from each other. For example, these baseline models always confound the latent state "Price Info" with the latent state "Reservation", due to certain words assigned large weights in the two states, such as "打折 (discount)", and "信用卡 (credit card)" etc. Furthermore, Only HSSM and LMHMMS have dialogue specific topics, and experimental results illustrate that HSSM can learn much better than LMHMMS which always mis-recognize corpus general words as belonging to dialogue specific topic (An example is presented in Table 3).

Quantitative Evaluation
For quantitative evaluation, we examine HSSM and traditional models with log likelihood and an ordering task on the held-out test set of Twitter-Post and AirTicketBooking. 3 We hide the corporation's real name for privacy reasons.   Log Likelihood The likelihood metric measures the probability of generating the test set using a specified model. The likelihood of LMHMM and TMHMM can be directed computed with the forward algorithm. However, since likelihoods of LMHMMS, TMHMMS and TMHMMSS are intractable to compute due to the local dependencies with respect to certain latent variables, Chibstyle estimating algorithms (Wallach et al., 2009) are employed in our experiments. For HSSM, the partition function is a key problem for calculating the likelihood, and it can be effectively estimated by Annealed Importance Sampling (AIS) (Neal, 2001;Salakhutdinov and Murray, 2008). Figure 5 presents the likelihood of different models on the two held-out datasets. We can observe that HSSM achieves better performance on likelihood than all the other models under different number of latent states. On Twitter-Post dataset our model slightly surpasses LMHMMS, and it performs much better than all traditional models on AirTicketBooking dataset.
Ordering Test Following previous work ( Barzilay and Lee, 2004;Ritter et al., 2010;Zhai and Williams, 2014), we utilize Kendall's τ (Kendall, 1938) as evaluation metric, which measures the similarity between any two sequential data and ranges from −1 (indicating a reverse ordering) to +1 (indicating an identical J = 10 J = 20 J = 30 ordering). This is the basic idea: for each dialogue session with n utterances in the test set, we firstly generate all n! permutations of the utterances; then evaluate the probability of each permutation, and measure the similarity, i.e. Kendall's τ , between the max-probability permutation and the original order; finally, we average τ values for all dialogue sessions as the model's ordering test score. As pointed out by Zhai et al. (2014), it's however infeasible to enumerate all possible permutations of dialogue sessions when the number of utterances in large. In experiments, we employ the incrementally adding permutation strategy, as used by Zhai et al. (2014), to build up the permutation set. The results of ordering test are presented in Figure 6. We can see that HSSM exhibits better performance than all the other models. For the conventional models, it is interesting that LMHMMS, TMHMMS and TMHMMSS achieve worse performances than LMHMM and TMHMM. This is likely because the latter two models allow words to be emitted only from latent states (Zhai and Williams, 2014), while the former three models allow words to be generated from additional sources. This also implies HSSM's effectiveness of modeling distinct information uderlying dialogues.

Discussion
The expermental results illustrate the effectiveness of the proposed undirected dialogue structure analysis model based on Boltzmann machine.
The conducted experiments also demonstrate that undirected models have three main merits for text modeling, which are also demonstrated by Hinton and Salakhutdinov (2009), Srivastava et al. (2013) through other tasks. Boltzmann machine based undirected models are able to generalize much better than traditional directed generative model; and model learning is more stable. Besides, an undirected model is more suitable for describing complex dependencies between different kinds of variables.
We also notice that all the models can, to some degree, capture the sequential structure in the dialogues, however, each model has a special characteristic which makes itself fit a certain kind of dataset better. HSSM and LMHMMS are more appropriate for modeling the open domain dataset, such as Twitter-Post used in this paper, and the task-oriented domain dataset with one relatively concentrated topic in the corpus and special information for each dialogue, such as AirTicket-Booking. As we known, dialogue specific topics in HSSM or LMHMMS are used and trained only within corresponding dialogues. They are crucial for absorbing certain words that have important meaning but do not belongs to latent states. In addition, for differet dataset, dialogue specific topics may have different effect to the modeling. Take the Twitter-Post for an example, dialogue specific topics formulate actual themes of dialogues, such as a pop song, a sport news. As for the AirTicketBooking dataset, dialogue specific topics always represent some special information, such as the personal information, including name, phone number, birthday, etc. In summary, each dialogue specific topic reflects special information which is different from other dialogues. The three models, TMHMM, TMHMMS and TMHMMSS, which do not include dialogue specific topics, should be utilized on the task-oriented domain dataset, in which each dialogue has little special or personnal information. For example, the three models perform well on the the BusTime and TechSupport datasets (Zhai and Williams, 2014), in which name entities are all replaced by different semantic types (e.g. phone numbers are replaced by "<phone>", E-mail addresses are replaced by "<email>", etc).

Conclusions
We develope an undirected generative model, HSSM, for dialogue structure analysis, and examine the effectiveness of our model on two different datasets, Twitter posts occurred in open-domain and task-oriented dialogues from airline ticket booking domain. Qualitative evaluations and quantitative experimental results demonstrate that the proposed model achieves better performance than state-of-the-art approaches. Compared with traditional models, the proposed HSSM has more powerful ability of discovering structures of latent states and modeling different word sources, including latent states, dialogue specific topics and global general topic.
According to recent study (Srivastava et al., 2013), a deep network model exhibits much benefits for latent variable learning. A dialogue may actually have a hierarchy structure of latent states, therefore the proposed model can be extended to a deep model to capture more complex structures. Another possible way to extend the model is to consider modeling long distance dependency between latent states. This may further improve the model's performance.