Dialog Intent Induction with Deep Multi-View Clustering

We introduce the dialog intent induction task and present a novel deep multi-view clustering approach to tackle the problem. Dialog intent induction aims at discovering user intents from user query utterances in human-human conversations such as dialogs between customer support agents and customers. Motivated by the intuition that a dialog intent is not only expressed in the user query utterance but also captured in the rest of the dialog, we split a conversation into two independent views and exploit multi-view clustering techniques for inducing the dialog intent. In par- ticular, we propose alternating-view k-means (AV-KMEANS) for joint multi-view represen- tation learning and clustering analysis. The key innovation is that the instance-view representations are updated iteratively by predicting the cluster assignment obtained from the alternative view, so that the multi-view representations of the instances lead to similar cluster assignments. Experiments on two public datasets show that AV-KMEANS can induce better dialog intent clusters than state-of-the-art unsupervised representation learning methods and standard multi-view clustering approaches.


Introduction
Goal-oriented dialog systems assist users to accomplish well-defined tasks with clear intents within a limited number of dialog turns. They have been adopted in a wide range of applications, including booking flights and restaurants (Hemphill et al., 1990;Williams, 2012), providing tourist information (Kim et al., 2016), aiding in the customer support domain, and powering intelligent virtual assistants such as Apple Siri, Amazon Alexa, or Google Assistant. The first step towards building such systems is to determine the target tasks and construct corresponding ontologies to define the constrained set of dialog states and actions (Henderson et al., 2014b;Mrkšić et al., 2015).
Existing work assumes the target tasks are given and excludes dialog intent discovery from the dialog system design pipeline. Because of this, most of the works focus on few simple dialog intents and fail to explore the realistic complexity of user intent space (Williams et al., 2013;Budzianowski et al., 2018). The assumption puts a great limitation on adapting goal-oriented dialog systems to important but complex domains like customer support and healthcare where having a complete view of user intents is impossible. For example, as shown in Fig. 1, it is non-trivial to predict user intents for troubleshooting a newly released product in advance. To address this problem, we propose to employ data-driven approaches to automatically discover user intents in dialogs from human-human conversations. Follow-up analysis can then be performed to identify the most valuable dialog intents and design dialog systems to automate the conversations accordingly.
Similar to previous work on user question/query intent induction (Sadikov et al., 2010;Haponchyk et al., 2018), we can induce dialog intents by clustering user query utterances 3 in human-human conversations. The key is to learn discriminative query utterance representations in the user intent semantic space. Unsupervised learning of such representations is challenging due to the semantic shift across different domains (Nida, 2015). We propose to overcome this difficulty by leveraging the rest of a conversation in addition to the user query utterance as a weak supervision signal. Consider the two dialogs presented in Fig. 1 where both of the users are looking for how to find their AirPods. Although the user query utterances vary in the choice of lexical items and syntactic structures, the human agents follow the same workflow to assist the users, resulting in similar conversation structures. 4 We present a deep multi-view clustering approach, alternating-view k-means (AV-KMEANS), to leverage the weak supervision for the semantic clustering problem. In this respect, we partition a dialog into two independent views: the user query utterance and the rest of the conversation. AV-KMEANS uses different neural encoders to embed the inputs corresponding to the two views and to encourage the representations learned by the encoders to yield similar cluster assignments. Specifically, we alternatingly perform k-means-style updates to compute the cluster assignment on one view and then train the encoder of the other view by predicting the assignment using a metric learning algorithm (Snell et al., 2017). Our method diverges from previous work on multi-view clustering (Bickel and Scheffer, 2004;Chaudhuri et al., 2009;Kumar et al., 2011), as it is able to learn robust representations via neural networks that are in clustering-analysis-friendly geometric spaces. Experimental results on a dialog intent induction dataset and a question intent clustering dataset show that AV-KMEANS significantly outperforms multi-view clustering algorithms without joint representation learning by 6-20% absolute F1 scores. It also gives rise to better F1 scores than quick thoughts (Logeswaran and Lee, 2018), a state-of-the-art unsupervised representation learning method. 3 We treat the initial user utterances of the dialogs as user query utterances. 4 Note this is not always the case. For the same dialog intent, the agent treatments may differ depending on the user profiles. The user may also change intent in the middle of a conversation. Thus, the supervision is often very noisy.
Our contributions are summarized as follows: • We introduce the dialog intent induction task and present a multi-view clustering formulation to solve the problem.
• We propose a novel deep multi-view clustering approach that jointly learns clusterdiscriminative representations and cluster assignments.
• We derive and annotate a dialog intent induction dataset obtained from a public Twitter corpus and process a duplicate question detection dataset into a question intent clustering dataset.
• The presented algorithm, AV-KMEANS, significantly outperforms previous state-of-theart multi-view clustering algorithms as well as two unsupervised representation learning methods on the two datasets.

Deep Multi-View Clustering
In this section, we present a novel method for joint multi-view representation learning and clustering analysis. We consider the case of two independent views, in which the first view corresponds to the user query utterance (query view) and the second one corresponds to the rest of the conversation (content view). Formally, given a set of n instances {x i }, we assume that each data point x i can be naturally partitioned into two independent views x (1) i and x (2) i . We further use two neural network encoders f φ 1 and f φ 2 to transform the two views into vec- We are interested in grouping the data points into K clusters using the multi-view feature representations. In particular, the neural encoders corresponding to the two views are jointly optimized so that they would commit to similar cluster assignments for the same instances.
In this work, we implement the query-view encoder f φ 1 with a bi-directional LSTM (BiLSTM) network (Hochreiter and Schmidhuber, 1997) and the content-view encoder f φ 2 with a hierarchical BiLSTM model that consists of a utterance-level BiLSTM encoder and a content-level BiLSTM encoder. The concatenations of the hidden representations from the last time steps are adopted as the query or content embeddings.

Alternating-view k-means clustering
In this work, we propose alternating-view k-means (AV-KMEANS) clustering, a novel method for deep multi-view clustering that iteratively updates neural encoders corresponding to the two views by encouraging them to yield similar cluster assignments for the same instances. In each semiiteration, we perform k-means-style updates to compute a cluster assignment and centroids on feature representations corresponding to one view, and then project the cluster assignment to the other view where the assignment is used to train the view encoder in a supervised learning fashion.
Algorithm 1: alternating-view k-means // project cluster assignment from view 2 to view 1 Update f φ 1 with pseudo training instances The full training algorithm is presented is a function that runs k-means clustering on inputs {x i }. K is the number of clusters. M and {z i } are optional arguments that represent the number of k-means iterations and the initial cluster assignment. The function returns cluster assignment {z i }. A visual demonstration of one semi-iteration of AV-KMEANS is also available in Fig. 2.
In particular, we initialize the encoders randomly or by using pretrained encoders ( § 2.3). Then, we can obtain the initial cluster assignment by performing k-means clustering on vector representations encoded by f φ 1 . During each AV-KMEANS iteration, we first project cluster assign- Figure 2: A depiction of a semi-iteration of the alternating-view k-means algorithm. k-means clustering and prototypical classification are performed for view 1 and view 2 respectively. The view 1 encoder is frozen and the view 2 encoder is updated in this semiiteration. ment from view 1 to view 2 and update the neural encoder for view 2 by formulating a supervised learning problem ( § 2.2). Then we perform M vanilla k-means steps to adjust the cluster assignment in view 2 based on the updated encoder. We repeat the procedure for view 2 in the same iteration. Note that in each semi-iteration, the initial centroids corresponding to a view are calculated based on the cluster assignment obtained from the other view. The algorithm runs a total number of T iterations.

Prototypical episode training
In each AV-KMEANS iteration, we need to solve two supervised classification problems using the pseudo training datasets {(x i )} respectively. A simple way to do so is putting a softmax classification layer on top of each encoder network. However, we find that it is beneficial to directly perform classification in the k-means clustering space. To this end, we adopt prototypical networks (Snell et al., 2017), a metric learning approach, to solely rely on the encoders to form the classifiers instead of introducing additional classification layers.
Given input data {(x i , z i )} and a neural network encoder f φ , prototypical networks compute a D-dimensional representation c k , or prototype, of each class by averaging the vectors of the embedded support points belonging to its class: here we drop the view superscripts for simplicity. Conceptually, the prototypes {c k } are similar to the centroids in the k-means algorithm, except that a prototype is computed on a subset of the instances of a class (the support set) while a centroid is computed based on all instances of a class. Given a sampled query data point x, prototypical networks produce a distribution over classes based on a softmax over distances to the prototypes in the embedding space: where the distance function is the squared Eu- The model minimizes the negative loglikelihood of the data: L(φ) = − log p(y = k|x). Training episodes are formed by randomly selecting a subset of classes from the training set, then choosing a subset of examples within each class to act as the support set and a subset of the remainder to serve as query points. We refer to the original paper (Snell et al., 2017) for more detailed description of the model.

Parameter initialization
Although AV-KMEANS can effectively work with random parameter initializations, we do expect that it will benefit from initializations obtained from pretrained models with some well-studied unsupervised learning objectives. We present two methods to initialize the utterance encoders for both the query and content views. The first approach is based on recurrent autoencoders. We embed an utterance using a BiLSTM encoder. The utterance embedding is then concatenated with every word vector corresponding to the decoder inputs that are fed into a uni-directional LSTM decoder. We use the neural encoder trained with the autoencoding objective to initialize the two utterance encoders in AV-KMEANS.
Recurrent autoencoders independently reconstruct an input utterance without capturing semantic dependencies across consecutive utterances. We consider a second initialization method, quick thoughts (Logeswaran and Lee, 2018), that addresses the problem by predicting a context utterance from a set of candidates given a target utterance. Here, the target utterances are sampled randomly from the corpus, and the context utterances are sampled from within each pair of adjacent utterances. We use two separate BiLSTM encoders to encode utterances, which are named as the target encoder f and the context encoder g. To score the compatibility of a target utterance s and a candidate context utterance t, we simply use the inner product of the two utterance vectors f (s) · g(t). The training objective maximizes the log-likelihood of the context utterance given the target utterance and the candidate utterance set. After pretraining, we adopt the target encoder to initialize the two utterance encoders in AV-KMEANS.

Data
As discussed in the introduction, existing goaloriented dialog datasets mostly concern predefined dialog intents in some narrow domains such as restaurant or travel booking (Henderson et al., 2014a;Budzianowski et al., 2018;Serban et al., 2018). To carry out this study, we adopt a more challenging corpus that consists of human-human conversations for customer service and manually annotate the user intents of a small number of dialogs. We also build a question intent clustering dataset to assess the generalization ability of the proposed method on the related problem.

Twitter airline customer support
We consider the customer support on Twitter corpus released by Kaggle, 5 which contains more than three million tweets and replies in the customer support domain. The tweets constitute conversations between customer support agents of some big companies and their customers. As the conversations regard a variety of dynamic topics, they serve as an ideal testbed for the dialog intent induction task. In the customer service domain, different industries generally address unrelated topics and concerns. We focus on dialogs in the airline industry, 6 as they represent the largest number of conversations in the corpus. We name  the resulting dataset the Twitter airline customer support (TwACS) corpus. We rejected any conversation that redirects the customer to a URL or another communication channel, e.g., direct messages. We ended up with a dataset of 43, 072 dialogs. The total numbers of dialog turns and tokens are 63, 147 and 2, 717, 295 respectively. After investigating 500 randomly sampled conversations from TwACS, we established an annotation task with 14 dialog intents and hired two annotators to label the sampled dialogs based on the user query utterances. The Cohen's kappa coefficient was 0.75, indicating a substantial agreement between the annotators. The disagreed items were resolved by a third annotator. To our knowledge, this is the first dialog intent induction dataset. The data statistics and user query utterance examples corresponding to different dialog intents are presented in Table 1.

AskUbuntu
AskUbuntu is a dataset collected and processed by Shah et al. (2018) for the duplicate question detection task. The dataset consists of technical support questions posted by users on AskUbuntu website with annotations indicating that two questions are semantically equivalent. For instance, q 1 : how to install ubuntu w/o removing windows q 2 : installing ubuntu over windows 8.1 are duplicate and they can be resolved with similar answers. A total number of 257, 173 questions are included in the dataset and 27, 289 pairs of questions are labeled as duplicate ones. In addition, we obtain the top rated answer for each question from the AskUbuntu website dump. 7 In this work, we reprocess the data and build a question intent clustering dataset using an automatic procedure. Following Haponchyk et al. (2018), we transform the duplicate question annotations into the question intent cluster annotations with a simple heuristic: for each question pair q 1 , q 2 annotated as a duplicate, we assigned q 1 and q 2 to the same cluster. As a result, the question intent clusters correspond to the connected components in the duplicate question graph. There are 7, 654 such connected components. However, most of the clusters are very small: 91.7% of the clusters contain only 2-5 questions. Therefore, we experiment with the largest 20 clusters that contain 4, 692 questions in this work. The sizes of the largest and the smallest clusters considered in this study are 1, 364 and 71 respectively.

Experiments
In this section, we evaluate AV-KMEANS on the TwACS and AskUbuntu datasets as described in § 3. We compare AV-KMEANS with competitive systems for representation learning or multi-view clustering and present our main findings in § 4.2. In addition, we examine the output clusters obtained from AV-KMEANS on the TwACS dataset to perform a thoughtful error analysis.

Experimental settings
We train the models on all the instances of a dataset and evaluate on the labeled instances. We employ the publicly available 300-dimensional  GloVe vectors (Pennington et al., 2014) pretrained with 840 billion tokens to initialize the word embeddings for all the models.
Competitive systems We consider state-of-theart methods for representation learning and/or multi-view clustering as our baseline systems. We formulate the dialog induction task as an unsupervised clustering task and include two popular clustering algorithms k-means and spectral clustering. multi-view spectral clustering (MVSC) (Kanaan-Izquierdo et al., 2018) is a competitive standard multi-view clustering approach. 8 In particular, we carry out clustering using the query-view and content-view representations learned by the representation learning methods (k-means only requires query-view representations). In the case where a content-view input corresponds to multiple utterances, we take the average of the utterance vectors as the content-view output representation for autoencoders and quick thoughts. AV-KMEANS is a joint representation learning and multiview clustering method. Therefore, we compare with SOTA representation learning methods autoencoders, and quick thoughts (Logeswaran and Lee, 2018). Quick thoughts is a strong representation learning baseline that is adopted in BERT (Devlin et al., 2019). We also include principal component analysis (PCA), a classic representation learning and dimensionality reduction method, since bag-of-words representations are too expensive to work with for clustering analysis.
We compare three variants of AV-KMEANS that differ in the pretraining strategies. In addition to the AV-KMEANS systems pretrained with autoencoders and quick thoughts, we also consider a system whose encoder parameters are randomly initialized (no pretraining).
Metrics We compare the competitive approaches on a number of standard evaluation measures for clustering analysis. Following prior work (Kumar et al., 2011;Haponchyk et al., 2018;Xie et al., 2016), we set the number of clusters to the number of ground truth categories and report precision, recall, F1 score, and unsupervised clustering accuracy (ACC). To compute precision or recall, we assign each predicted cluster to the most frequent gold cluster or assign each gold cluster to the most frequent predicted cluster respectively. The F1 score is the harmonic average of the precision and recall. ACC uses a one-to-one assignment between the gold standard clusters and the predicted clusters.
The assignment can be efficiently computed by the Hungarian algorithm (Kuhn, 1955).
Parameter tuning We empirically set both the dimension of the LSTM hidden state and the number of principal components in PCA to 300. The number of AV-KMEANS iterations T and the number of k-means steps in a AV-KMEANS semiiteration M are set to 50 and 10 respectively, as we find that more iterations lead to similar cluster assignments. We adopt the same set of hyperparameter values as used by Snell et al. (2017) for training the prototypical networks. Specifically, we fix the number of query examples and the number of support examples to 15 and 5. The networks are trained for 100 episodes per AV-KMEANS semi-iteration. The number of sampled classes per episode is chosen to be 10, as it has to be smaller than the number of ground truth clusters. Adam (Kingma and Ba, 2015) is utilized to optimize the models and the initial learning rate is 0.001. During autoencoders or quick thoughts pretraining, we check the performance on the development set after each epoch to perform early stopping, where we randomly sample 10% unlabeled instances as the development data.

Results
Our main empirical findings are presented in Table 2, in which we compare AV-KMEANS with standard single-view and multi-view clustering algorithms. We also evaluate classic and neural approaches for representation learning, where the pretrained representations are fixed during kmeans and MVSC clustering and they are finetuned during AV-KMEANS clustering. We analyze the empirical results in details in the following paragraphs.
Utilizing multi-view information Among all the systems, k-means clustering on representations trained with PCA or autoencoders only employs single-view information encoded in user query utterances. They clearly underperform the rest of the systems that leverage multi-view information of the entire conversations. Quick thoughts infuses the multi-view knowledge through the learning of the query-view vectors that are aware of the content-view semantics. In contrast, multi-view spectral clustering can work with representations that are separately learned for the individual views and the multi-view information is aggregated using the common eigenvectors of the data similarity Laplacian matrices. As shown, k-means clustering on quick thoughts vectors gives superior results than MVSC pretrained with PCA or autoencoders by more than 10% F1 or ACC, which indicates that multi-view representation learning is effective for problems beyond simple supervised learning tasks. Combining representation learning and multi-view clustering in a static way seems to be less ideal-MVSC performs worse than k-means using the quick thoughts vectors as clustering inputs. Multi-view representation learning breaks the independent-view assumption that is critical for classic multi-view clustering algorithms.

Joint representation learning and clustering
We now investigate whether joint representation learning and clustering can reconcile the conflict between cross-view representation learning and classic multi-view clustering. AV-KMEANS outperforms k-means and MVSC baselines by considerable margins. It achieves 46% and 53% F1 scores and 40% and 44% ACC scores on the TwACS and AskUbuntu datasets, which are 5-30 percent higher than competitive systems. Compared to alternative methods, AV-KMEANS is able to effectively seek clustering-friendly representations that also encourage similar cluster assignments for different views of the same instances. With the help of quick thoughts pretraining, AV-KMEANS improves upon the strongest baseline, k-means clustering on quick thoughts vectors, by 4.5% ACC on the TwACS dataset and 12.2% F1 on the AskUbuntu dataset.
Model pretraining for AV-KMEANS Evaluation results on AV-KMEANS with different parameter initialization strategies are available in Table 2. As suggested, pretraining neural encoders is important for obtaining competitive results on the TwACS dataset, while its impact on the AskUbuntu dataset is less pronounced. AskUbuntu is six times larger than TwACS and models trained on AskUbuntu are less sensitive to their parameter initializations. This observation is consistent with early research on unsupervised pretraining, where Schmidhuber et al. (2012) argue that unsupervised initialization/pretraining is not necessary if a large amount of training data is available. Between the two pretraining methods, quick thoughts is much more effective than autoencoders-it improves upon no pretraining and autoencoders by 10.4% and 8.3% ACC scores on the TwACS dataset.

Error analysis
Our best performed system still fails to hit 50% F1 or ACC score on the TwACS dataset. We examine the outputs of the quick thoughts pretrained AV-KMEANS on TwACS, focusing on investigating the most frequent errors made by the system. To this end, we compute the confusion matrix based on the one-to-one assignment between the gold clusters and the predicted clusters used by ACC.  Sometimes, a user may express more than one intents in a single query utterance. For example, in the following query utterance, the user complaints about the delay and requests for an alternative flight: q : why is ba flight 82 from abuja to london delayed almost 24 hours? and are you offering any alternatives?
We leave multi-intent induction to future work.

Related Work
User intent clustering Automatic discovery of user intents by clustering user utterances is a critical task in understanding the dynamics of a domain with user generated content. Previous work focuses on grouping similar web queries or user questions together using supervised or unsupervised clustering techniques. Kathuria et al. (2010) perform simple k-means clustering on a variety of query traits to understand user intents. Cheung and Li (2012) present an unsupervised method for query intent clustering that produces a pattern consisting of a sequence of semantic concepts and/or lexical items for each intent. Jeon et al. (2005) use machine translation to estimate word translation probabilities and retrieve similar questions from question archives. A variation of k-means algorithm, MiXKmeans, is presented by Deepak (2016) to cluster threads that present on forums and Community Question Answering websites. Haponchyk et al. (2018) propose to cluster questions into intents using a supervised learning method that yields better semantic similarity modeling. Our work focuses on a related but different task that automatically induces user intents for building dialog systems. Two sources of information are naturally available for exploring our deep multi-view clustering approach.
Multi-view clustering Multi-view clustering (MVC) aims at grouping similar subjects into the same cluster by combining the available multiview feature information to search for consistent cluster assignments across different views (Chao et al., 2017). Generative MVC approaches assume that the data is drawn from a mixture model and the membership information can be inferred using the multi-view EM algorithm (Bickel and Scheffer, 2004). Most of the works on MVC employ discriminative approaches that directly optimize an objective function that involves pairwise similarities so that the average similarity within clusters can be minimized and the average similarity between clusters can be maximized. In particular, Chaudhuri et al. (2009) propose to exploit canonical correlation analysis to learn multi-view representations that are then used for downstream clustering. Multi-view spectral clustering (Kumar et al., 2011;Kanaan-Izquierdo et al., 2018) constructs a similarity matrix for each view and then iteratively updates a matrix using the eigenvectors of the similarity matrix computed on another view. Standard MVC algorithms expect multi-view feature inputs that are fixed during unsupervised clustering. AV-KMEANS works with raw multi-view text inputs and learns representations that are particularly suitable for clustering.

Joint representation learning and clustering
Several recent works propose to jointly learn feature representations and clustering via neural networks. Xie et al. (2016) present the deep embedded clustering (DEC) method that learns a map-ping from the data space to a lower-dimensional feature space where it iteratively optimizes a KL divergence based clustering objective. Deep clustering network (DCN) (Yang et al., 2016) is a joint dimensional reduction and k-means clustering framework, in which the dimensional reduction model is implemented with a deep neural network. These methods focus on the learning of single-view representations and the multi-view information is under-explored. Lin et al. (2018) present a joint framework for deep multi-view clustering (DMJC) that is the closest work to ours. However, DMJC only works with single-view inputs and the feature representations are learned using a multi-view fusion mechanism. In contrast, AV-KMEANS assumes that the inputs can be naturally partitioned into multiple views and carry out learning with the multi-view inputs directly.

Conclusion
We introduce the novel task of dialog intent induction that concerns automatic discovery of dialog intents from user query utterances in humanhuman conversations. The resulting dialog intents provide valuable insights in helping design goal-oriented dialog systems. We propose to leverage the dialog structure to divide a dialog into two independent views and then present AV-KMEANS, a deep multi-view clustering algorithm, to jointly perform multi-view representation learning and clustering on the views. We conduct extensive experiments on a Twitter conversation dataset and a question intent clustering dataset. The results demonstrate the superiority of AV-KMEANS over competitive representation learning and multi-view clustering baselines. In the future, we would like to abstract multi-view data from multi-lingual and multi-modal sources and investigate the effectiveness of AV-KMEANS on a wider range of tasks in the multi-lingual or multi-modal settings.