Semi-supervised Interactive Intent Labeling

Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system where SDS developers can interactively label and augment training data from unlabeled utterance corpora using advanced clustering and visual labeling methods. We extend the Deep Aligned Clustering work with a better backbone BERT model, explore techniques to select the seed data for labeling, and develop a data balancing method using an oversampling technique that utilizes paraphrasing models. We also look at the effect of data augmentation on the clustering process. Our results show that we can achieve over 10% gain in clustering accuracy on some datasets using the combination of the above techniques. Finally, we extract utterance embeddings from the clustering model and plot the data to interactively bulk label the samples, reducing the time and effort for data labeling of the whole dataset significantly.


Introduction
Acquiring an accurately labeled corpus is necessary for training machine learning (ML) models in various classification applications. Labeling is an expensive and labor-intensive activity requiring annotators to understand the domain well and to label the instances one at a time. In this work, we explore the task of labeling multiple intents visually with the help of a semi-supervised clustering algorithm. The clustering algorithm helps learn an embedding representation of the training data that is well-suited for downstream labeling. In order to label, we further reduce the high dimensional representation using the UMAP (McInnes et al., 2018). Since utterances are short, uncovering their semantic meaning to group them together is very challenging. SBERT (Reimers and Gurevych, 2019) showed that out-of-the-box BERT (Devlin et al., 2018) maps sentences to a vector space that is not very suitable to be used with common measures like cosine-similarity and euclidean distances. This happens because in the BERT network, there is no independent sentence embedding computation, which makes it difficult to derive sentence embeddings. Researchers utilize the mean pooling of word embeddings as an approximate measure of the sentence embedding. However, results show that this practice yields inappropriate sentence embeddings that are often worse than averaging GloVe embeddings (Pennington et al., 2014;Reimers and Gurevych, 2019). Many researchers have developed sentence embedding methods: Skip-Thought (Kiros et al., 2015), In-ferSent (Conneau et al., 2017), USE (Cer et al., 2018), SBERT (Reimers and Gurevych, 2019). State-of-the-art SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding and fine-tunes a Siamese network on the sentence-pairs from the NLI (Bowman et al., 2015;Williams et al., 2017) and STSb (Cer et al., 2017) datasets.
The Deep Aligned Clustering (DAC) (Zhang et al., 2021) introduced an effective method for clustering and discovering new intents. DAC transfers the prior knowledge of a limited number of known intents and incorporates a technique to align cluster centroids in successive training epochs. The limited known intents are used to pre-train the model. The authors use the pre-trained BERT model (Devlin et al., 2018) to extract deep intent features, then pre-train the model with a randomly selected subset of labeled data. The pre-trained parameters are used to obtain well-initialized intent representations. K-Means clustering is performed on the extracted intent features along with a method to estimate the number of clusters and the alignment strategy to obtain the final cluster assignments. The K-Means algorithm selects cluster centroids that minimize the Euclidean distance within the cluster. Due to this Euclidean distance optimization, clustering using the SBERT model to extract feature embeddings naturally outperforms other embedding methods. In our work, we have extended the DAC algorithm with the SBERT as an embedding backbone for clustering of utterances.
In semi-supervised learning, the seed set is selected using a sampling strategy: "A simple random sample of size n consists of n individuals from the population chosen such that every set of n individuals has an equal chance to be the sample actually selected." (Moore and McCabe, 1989). However, these sample subsets may not represent the original data adequately because randomization methods do not exploit the correlations in the original population. In a stratified random sample, the population is classified first into groups (called strata) with similar characteristics. Then a simple random sample is chosen from each strata separately. These simple random samples are combined to form the overall sample. Stratified sampling can help ensure that there are enough observations within each strata to make meaningful inferences. DAC uses the Random Sampling method for seed selection. In this work, we have explored a couple of stratified sampling approaches for seed selection in hope to mitigate the limitations of random sampling and improve the clustering outcome.
Another issue we address in this work is class sample imbalance. Seed selection generally yields an imbalanced dataset, which in turn impairs the predictive capability of the classification algorithms (Douzas et al., 2018). Some methods manipulate the training data, aiming to change the class distribution towards a more balanced one by undersampling or oversampling (Kotsiantis et al., 2006;Galar et al., 2011). SMOTE (Chawla et al., 2002) is a popular oversampling technique proposed to improve random oversampling. In one variant of SMOTE, borderline minority instances are heuristically selected and linearly interpolated to create synthetic samples. In this work, we take inspiration from the SMOTE method and choose borderline minority instances and paraphrase them using a Sequence to Sequence Paraphrasing model. The paraphrases provide natural and meaningful augmentations of the dataset that are not synthetic.
Previous work has shown that data augmentation can boost performance on text classification tasks (Barzilay and McKeown, 2001;Dolan and Brockett, 2005;Lan et al., 2017;Hu et al., 2019). Wieting et al. (2017) used Neural Machine Translation (NMT) (Sutskever et al., 2014) to translate the non-English side of the parallel text to get English-English paraphrase pairs. This method has been scaled to generate large paraphrase corpora (Wieting and Gimpel, 2018). Prior work in learning paraphrases has used autoencoders (Socher et al., 2011), encoder-decoder architectures as in BART , and other learning frameworks such as NMT (Sokolov and Filimonov, 2020). Data augmentation using paraphrasing is a simple yet effective strategy that we explored in this work to improve the clustering.
For interactive visual labeling of utterances, we build up from the learnt embedding representation of the data and fine-tune it using the clustering. DAC learns to cluster with a weak self-supervised signal to update its representation and to optimize both local (via K-Means) and global information (via cluster alignment). This results in an optimized intent-level feature representation. This high dimensional latent representation can be reduced to 2-3 dimensions using the Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018). We use Rasa WhatLies 1 library  to extract the UMAP embeddings. For interactive labeling, we utilize an interactive visualization library called Human Learn 2 (Warmerdam et al., 2021) that allows us to draw decision boundaries on a plot. By building on top of the work of Rasa Bulk Labelling 3 UI Bokeh Development Team, 2018), we augment the interface with our learnt representation for interactive labeling. Although we focus on NLU, other studies like 'Conversation Learner' (Shukla et al., 2020) focus on interactive dialogue managers (DM) with human-in-the-loop annotations of dialogue data via machine teaching. Note also that although the majority of task-oriented SDS still involves defining intents/entities, there are recent examples that argue for a richer target representation than the classical intent/entity model, such as SMCalFlow (Andreas et al., 2020).  Figure 1 describes the semi-supervised labeling process. We start with the unlabeled utterance corpus and apply seed sampling methods to select a small subset of the corpus. Once the selected subset is manually labeled, we address the data imbalance with our paraphrase-based minority oversampling method. We can also augment the labeled corpus with paraphrasing to provide more data for the clustering process. The DAC algorithm is applied with improved embeddings to extract the utterance representation for interactive labeling.

Sentence Representation
For sentence representation, we use the Hug-gingFace Transformers model BERT-base-nli-stsbmean-tokens 4 . This model was first fine-tuned on a combination of Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) (570K sentence-pairs with labels contradiction, entailment, and neutral) and Multi-Genre Natural Language Inference (Williams et al., 2017) (430K diverse sentence-pairs with same labels as SNLI) datasets, then on Semantic Textual Similarity benchmark (STSb) (Cer et al., 2017) (provide labels between 0 and 5 on the semantic relatedness of sentence pairs) training set. This model achieves a performance of 85.14 (Spearman's rank correlation between the cosine-similarity of the sentence embeddings and the gold labels) on STSb regression evaluation. For context, the average BERT embeddings achieve a performance of 46.35 on this evaluation (Reimers and Gurevych, 2019).

Seed Selection
We explore two selection and sampling strategies for seed selection as follows: • Cluster-based Selection (CB): In this method, we apply K-Means clustering on the N utterances to partition the data into n seed number of subsets. For example, if 10% of the data has 100 utterances, this method creates 100 clusters from the dataset. We then pick the centroid's nearest neighbor as part of the seed set for all the clusters. The naive intuition for this strategy is that it would create a large number of clusters spread all over the data distribution (N/n instances per cluster on average for uniformly distributed instances).
• Predicted Cluster Sampling (PCS): This is a stratified sampling method where we first predict the number of clusters and then sample instances from each cluster. We use the cluster size estimation method from the DAC work as follows: K-Means is performed with a large K (initialized with twice the ground truth number of classes). The assumption is that real clusters tend to be dense and the cluster mean size threshold is assumed to be N/K'. where |S i | is the size of the ith produced cluster, and δ(condition) is an indicator function. It outputs 1 if condition is satisfied, and outputs 0 if not. The method seems to perform well as reported in DAC work.

Data Balancing and Augmentation
For handling data imbalance, we propose a paraphrasing-based method to over-sample the minority classes. The method is described as follows: 1. For every instance p i (i = 1, 2, ..., p num ) in the minority class P , we calculate its m nearest neighbors from the whole training set T . The number of majority examples among the m nearest neighbors is denoted by m (0 ≤ m ≤ m).
2. If m = m , i.e., all the m nearest neighbors of p i are majority examples, p i is considered to be noise and is not operated in the following steps. If m 2 ≤ m < m, namely the number of p i 's majority nearest neighbors is larger than the number of its minority ones, p i is considered to be easily misclassified and put into a set DANGER. If 0 ≤ m < m 2 , p i is safe and does not need to participate in the following steps. 5. We classify the paraphrased sample with a RoBERTa  based classifier fine-tuned on the labeled data and only add the instance if the classifier predicts the same label as the minority instance. We call this the 'ParaMote' method in our experiments. Without this last step (5), we call this overall approach our 'Paraphrasing' method.
We use the Paraphrasing model and the classifier as a data augmentation method to augment the labeled training data (refer to as 'Aug' in our experiments). Note that we augment the paraphrased sample if it belongs to the same minority class ('ParaMote') as we do not want to inject noise while solving the data imbalance problem. The opposite is also possible for other purposes such as generating semantically similar adversaries (Ribeiro et al., 2018).

Experimental Results
To conduct our experiments, we use the BANK-ING (Casanueva et al., 2020) and CLINC (Larson et al., 2019) datasets similar to the DAC work (Zhang et al., 2021). We also use another dataset called KidSpace that includes utterances from a Multimodal Learning Application for 5-to-8 years-old children (Sahay et al., 2019;Anderson et al., 2018). We hope to utilize this system to label future utterances into relevant intents. Table 1 shows the statistics of the 3 datasets where 25% random classes are kept unseen at pre-training.

Sentence Representation
The choice of pre-trained embeddings has the largest impact on the clustering results. We observe huge performance gains for the single domain KidSpace and BANKING datasets. For the multidomain and diverse CLINC dataset with the largest number of intents, we saw a slight degradation in performance. While this needs further investigation, we believe the dataset is diverse enough and already has very high clustering scores and that the improved sentence representations may not be helping further.

Seed Selection
Seed selection is an important problem for limited data tasks. Law of large numbers does not hold and  random sampling strategy may lead to larger variance in outcomes. We explored Cluster-based Selection (CB) and Predicted Cluster Sampling (PCS) besides other techniques (see detailed results in Appendix A.1). Our results trend towards smaller standard deviations and similar performance for the BANKING and CLINC datasets with the PCS method. Surprisingly, this does not hold for the KidSpace dataset that needs further investigation. Figure 2 shows the KidSpace data visualised with various colored clusters and centroids. While we non-randomly choose seed data, we still hide 25% of the classes at random (to enable unknown intent discovery). Our recommendation is to use PCS if one cannot run the training multiple times for certain situations to have less variance in results. Figure 3 shows the histogram for the seed data, which is highly imbalanced and may adversely impact the clustering performance. We apply Paraphrasing and ParaMote methods to balance the data. Paraphrasing almost always improves the performance while the additional classifier to check for class-label consistency (ParaMote) does not help.

Data Augmentation
We augmented the entire labeled data including the majority class using Paraphrasing (with classlabel consistency) by 3x in our experiments. We aimed to understand if this could help get a better pre-trained model that could eventually improve the clustering outcome. We do not observe any performance gains with the augmentation process.

Interactive Data Labeling
Our goal in this work is to develop a wellsegmented learnt representation of the data with deep clustering and then to use the learnt representation to enable fast visual labeling. Figure 4 shows the two clustered representations, one without pretraining and BERT-base embedding while the other with a fine-tuned sentence BERT representation and pre-training. We can obtain well separated visual clusters using the latter approach. We use the drawing library human-learn to visually label the data. Figure 5 shows selected region of the data with various labels and class confusion. We notice that this representation not only helps with the labeling but also helps with correcting the labels and identify utterances that belong to multiple classes which cannot be easily segmented. For example, 'children-valid-answer' and 'children-invalid-grow' (invalid answers) contain semantically similar content depending on the game logic of the interaction. We perhaps need to group these together and use an alternative logic for implementing game semantics.

Conclusion
In this exploration, we have used fine-tuned sentence BERT model to significantly improve the clustering performance. Predicted Cluster Sampling strategy for seed data selection seems to be a promising approach with possibly lower variance in clustering performance for smaller data labeling tasks. Paraphrasing-based data imbalance handling slightly improves the clustering performance as well. Finally, we have utilized the learnt representation to develop a visual intent labeling system.

A.1 Additional Experimental Results
In addition to the Cluster-based Selection (CB) and Predicted Cluster Sampling (PCS) methods, we have explored other seed selection techniques compared with the Random Sampling. These are the Known Cluster-based Selection (KCB) and Clusterbased Sentence Embedding (CSE) methods. KCB is a variation of CB where we cluster into a number of known labels' subsets (based on known class ratio) and pick up certain % of data (based on labeled ratio) from each cluster's data points. CSE, on the other hand, is another variation of CB where, instead of BERT word embeddings as the pre-trained representations, we use the sentence embeddings model before running K-Means (the rest is the same as the CB method). Table 3 presents detailed clustering performance results on three datasets using all five seed selection methods we explored, with varying labeled ratio and BERT embeddings (standard/BERT-base vs. sentence/SBERT models). In Table 4, we expand our analysis on the KidSpace dataset with data balancing/augmentation approaches on top of these five seed selection methods, once again with standard/sentence BERT embeddings. Table 5 presents additional results on the BANKING dataset to compare data balancing/augmentation methods on top of standard vs. the sentence BERT representations.