Probing Task-Oriented Dialogue Representation from Language Models

This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We approach the problem from two aspects: supervised classifier probe and unsupervised mutual information probe. We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way. Meanwhile, we propose an unsupervised mutual information probe to evaluate the mutual dependence between a real clustering and a representation clustering. The goals of this empirical paper are to 1) investigate probing techniques, especially from the unsupervised mutual information aspect, 2) provide guidelines of pre-trained language model selection for the dialogue research community, 3) find insights of pre-training factors for dialogue application that may be the key to success.


Introduction
Task-oriented dialogue systems achieve specific user goals within a limited number of dialogue turns via natural language. They have been used in a wide range of applications, such as booking restaurants (Wen et al., 2017), providing tourist information (Budzianowski et al., 2018;Wu et al., 2019b), ordering tickets (Schulz et al., 2017), and healthcare consultation . They are also crucial components of intelligent virtual assistants like Siri, Alexa, and Google Assistant.
Most of the task-oriented dialogue systems nowadays, are benefited from transfer learning (Wu et al., 2019a;Lin et al., 2020), especially pretrained language models trained on general text, such as BERT (Devlin et al., 2018) and GPT2 (Radford et al., 2019). However, previous work claims that linguistic patterns could differ between writing text and human conversation, resulting in a large gap of data distributions (Bao et al., 2019;Wolf et al., 2019b). Recently, several approaches are leveraging open-domain data (Henderson et al., 2019;Zhang et al., 2019), or aggregating taskoriented data (Wu et al., 2020) to pre-train language models. In this paper, we are interested in answering these questions: which language model has the most informative representations that is better for what task-oriented dialogue task? Does pretraining with dialogue-specific data or different objectives make any difference? We investigate how good these pre-trained representations are for a task-oriented dialogue system, ignoring the model architectures and training strategies by only probing their final representations with fine-tuning models. A good representation implies better knowledge transferring and domain generalization ability, making downstream applications easier and cheaper to be improved.
We tackle this problem with two probing solutions: supervised classifier probe and unsupervised mutual information probe. Classifier probe is commonly used in different NLP tasks such as morphology (Belinkov et al., 2017), sentence length (Adi et al., 2016), or linguistic structure (Hewitt and Manning, 2019). In this setting, we fine-tune a simple classifier for a specific task (e.g., intent identification) on a fixed pre-trained language model. The probe uses supervision to find the best transformation for each sub-task.
In addition, we present mutual information probe to investigate these language models by directly clustering their output representations, as recent study (Pimentel et al., 2020) suggests that a simple classifier may not be able to achieve the best estimate of mutual information between features and the downstream task. We apply two clustering techniques, K-means (Lloyd, 1982) and Gaussian mixture model (Reynolds, 2009), to calculate its adjusted normalized mutual information (ANMI) (Vinh et al., 2010) between the predicted clustering and the true task-specific clustering.
We investigate 12 language models, as shown in Table 1, where five of them have been pre-trained with dialogue data. We evaluate four core taskoriented dialogue tasks, domain identification, intent detection, slot tagging, and dialogue act prediction. They correspond to the commonly defined natural language understanding, dialogue state tracking, and dialogue management modules (Wen et al., 2017). We hope our probing analysis can provide insights to facilitate future task-oriented dialogue research. Some of the key observations in this work are summarized here (More discussion in Section 4.4): • No matter the open-domain or close-domain, pretraining with dialogue data helps learning better representations for task-oriented dialogue.
• Pre-trained language models intrinsically contain more information about intents and dialogue acts but less for slots.
• ConveRT (Henderson et al., 2019) and TOD-BERT-jnt (Wu et al., 2020) have the highest classification accuracy and mutual information score, suggesting that response selection is useful for dialogue pre-training, especially when we compare TOD-BERT-jnt to TOD-BERT-mlm.
• Top models also include TOD-GPT2 and Distil-BERT (Sanh et al., 2019). The distilled version of BERT surprisingly outperforms BERT and other strong baselines such as RoBERTa .
• DialoGPT and GPT2 do not perform well on mutual information evaluation but have a middleranking classification accuracy, implying that their representations are informative but not suitable for unsupervised clustering.
• Models such as AlBERT (Lan et al., 2019) and ELECTRA (Clark et al., 2020) have low classification accuracy and mutual information, showing the least useful information on task-oriented dialogue tasks.
2 Pre-Trained Language Models W can roughly divide pre-trained language models into two categories: uni-directional and bidirectional. BERT-based systems are bi-directional language models and usually trained with the masked language modeling (MLM) objective, i.e., given the left and right context to predict the current masked token. GPT-based models, on the other hand, are uni-directional language models trained always to predict the next token in an autoregressive way. For a BERT-based model, we use the final-layer hidden state of its first token, [CLS], to represent an input sequence. This built-in token is originally designed to aggregate the information. Since GPTbased models are uni-directional and do not have a similar design as the [CLS] token, we use the mean pooling of its output hidden states to represent the input sequence, which is better than only using the last hidden state in our experiments.
BERT-based BERT is a Transformer (Vaswani et al., 2017) encoder with a self-attention mechanism, which is trained on Wikipedia and BookCorpus using the MLM and next sentence prediction objectives.  proposed a robustly optimized approach for BERT, call RoBERTa, where they improved it by training the model longer with bigger batches over more data and longer sequences, and removing the next sentence prediction objective. Lan et al. (2019) proposed a lite BERT (AlBERT) that trained with MLM and inter-sentence coherence losses, and aimed to lower memory consumption and increase the training speed. With similar motivation, Sanh et al. (2019) trained a DistilBERT that reduce 40% of parameters with a triple loss, including MLM, distillation, and cosine-distance losses. Clark et al. (2020) proposed ELECTRA using a sample-efficient pretraining task called replaced token detection. They used a generator network (ELECTRA-GEN) to replace tokens with plausible alternative tokens and trained a discriminative model (ELECTRA-DIS) to predict whether the generator replaced each token in the input.
Most of the pre-trained models above are trained on general text corpora with language modeling objectives. Henderson et al. (2019), on the other hand, used social media conversational data to train the ConveRT model. It is a Transformer-based dualencoder model pre-trained on a dialogue response selection task using 727M Reddit (input, response) pairs. Very recently, Wu et al. (2020) proposed taskoriented dialogue BERT (TOD-BERT), which is initialized by BERT and further pre-trained on nine publicly available task-oriented dialogue corpora. They have one version with only MLM objective (TOD-BERT-mlm) and another with both MLM and contrastive learning objectives of response selection (TOD-BERT-jnt). TOD-BERT has shown good performance on several task-oriented downstream tasks, especially in the few-shot setting.
GPT-based GPT2 (Radford et al., 2019) is the representative of uni-directional language models using a Transformer decoder, where the objective is to maximize left-to-right generation likelihood. To ensure diverse and nearly unlimited text sources, they use Common Crawl to obtain 8M documents as its training data. Budzianowski and Vulić (2019) trained GPT2 on task-oriented response generation task, taking system belief, database result, and last dialogue turn as inputs. It only uses one dataset to train its model because few public datasets have database information available for pre-training. Zhang et al. (2019) pre-trained GPT2 on 147M open-domain Reddit data for response generation and called it DialoGPT. It aims to generate more relevant, contentful, and consistent responses for chit-chat dialogue systems. In this paper, following TOD-BERT's idea, we train a task-oriented GPT2 model (TOD-GPT2) built on the GPT2 model and further pre-trained with task-oriented datasets. We use the same dataset collection, which contains nine datasets in total, as shown in Wu et al. (2020), to pre-train the model as a reference.

Method
We define a dialogue corpus D = {D 1 , . . . , D M } has M dialogue samples, and each dialogue sample D m has T turns of conversational exchange {U 1 , S 1 . . . , U T , S T } between a user and a system. For every utterance U t or S t , we have humanannotated domain, user intent, slot, and dialogue act labels. We first feed all the utterances to a pre-trained model and obtain user and system representations. In this section, we first discuss how we design our classifier probe and then introduce our mutual information probe's background and usage.

Classifier Probe
We use a simple classifier to transform those representations for a specific task and optimize it with annotated data.
is a feed-forward layer that maps from dimension d B to a prediction with N classes, and A is an activation layer. For domain identification and intent detection, we use a Softmax layer and backpropagate with the cross-entropy loss. For dialogue slot and act prediction, we use a Sigmoid layer and the binary cross-entropy loss since they are multi-label classification tasks.

Mutual Information Probe
We first cluster utterances in an unsupervised fashion using either K-means (Lloyd, 1982) or Gaussian mixture model (GMM) (Reynolds, 2009) with K clusters. Then we compute the adjusted mutual information score (Vinh et al., 2010) between the predicted clustering and each of the true clusterings (e.g., domain and intent) for different hyperparameters K. Note that the predicted clustering is not dependent on any particular labels.

Utterance Clustering
K-means is a common clustering algorithm that aims to partition N samples into K clusters A = {A 1 , . . . , A K } in which each sample is assigned to a cluster centroid with the nearest mean.
arg max where µ i is the centroid of the A i cluster and the algorithm is updated in an iterative manner. On the other hand, GMM assumes a certain number of Gaussian distributions (K mixture components). It takes both mean and variance of the data into account, while K-means only consider the data's mean. By the Expectation-Maximization algorithm, GMM first calculates each sample's probability belongs to a cluster A i during the E-step, then updates its density function to compute new mean and variance during the M-step.
In our experiments, we cluster separately for user utterances U and system response S. Note that K is a hyper-parameter since we may not know the true distribution in a real scenario. To avoid the local minimum issue, we run multiple times (typically ten runs) and use the best clustering result for mutual information evaluation.

ANMI
To evaluate two clusterings' quality, we compute the ANMI score between a clustering and its ground-truth annotation. ANMI is adjusted for randomness, which accounts for the bias in mutual information, giving high values to the clustering with a larger number of clusters. ANMI has a value of 1 when two partitions are identical, and an expected value of 0 for random (independent) partitions.
More specifically, we assume two label clusterings, A and B, that have the same N objects. The mutual information (MI) between A and B is defined by where P (i, j) = |A i ∩B j |/N is the probability that a randomly picked sample falls into both A i and B j classes. Similarly, P (i) = |A i |/N and P (j) = |B j |/N are the probabilities that the sample falls into either the A i or B j class. The normalized mutual information (NMI) normalizes MI with the mean of entropy, which is defined as where is the entropy of the A clustering, which measures the amount of uncertainty for the partition set.
MI and NMI are not adjusted for chance and will tend to increase as the number of cluster increases, regardless of the actual amount of "mutual informaiton" between the label assignments. Therefore, adjusted normalized mutual information (ANMI) is designed to modify NMI score with its expectation,  which is defined by

Datasets
The multi-domain Wizard-of-Oz (MWOZ) dataset (Budzianowski et al., 2018) is one of the most common benchmark datasets for task-oriented dialogue systems. We use MWOZ to evaluate domain identification, dialogue slot tagging, and dialogue act prediction tasks. It contains 8420/1000/1000 dialogues for training, validation, and testing sets, respectively. There are seven domains in the training set and five domains in the others. There are 13 unique system dialogue acts and 18 unique slots as shown in Table 2. Besides, we use the out-of-scope intent (OOS) dataset (Larson et al., 2019) for our intent detection experiment. The OOS dataset is one of the largest annotated intent datasets, including 15,100/3,100/5,500 samples for the train, validation, and test sets, respectively. It has 150 intent classes over ten domains and an additional out-ofscope intent class, a user utterance that does not fall into any of the predefined intents. The whole intent list is shown in the Appendix.

Training Details
We first process user utterance and system response using the tokenizer corresponding to each per-trained model. To obtain each representation, we run most of the pre-trained models using the HuggingFace (Wolf et al., 2019a) library, except the ConveRT 1 and TOD-BERT 2 . We fine-tune GPT2 using its default hyper-parameters and the same nine datasets as shown in Wu et al. (2020) to train for TOD-GPT2 model. For classifier probing, we fine-tune the top layer with a consistent hyperparameter setting. We apply AdamW (Loshchilov and Hutter, 2017) optimizer with a learning rate 5e −5 and gradient clipping 1.0. We use K = 4, 8, 16, 32, 64, 128, 256 with 50 iterations each, and report the moving trend for MI probing. We use GMM clustering from the scikit-learn library, and we adopt the K-means implementation from the faiss library (Johnson et al., 2017). Experiments were conducted on a single NVIDIA Tesla V100 GPU.

Evaluation
Domain identification and intent detection tasks are multi-class classification problems. Therefore, we can directly use their annotated domain and intent labels to compute the ANMI scores. Slot tagging and dialogue act prediction tasks, meanwhile, are multi-label classification problems. For example, each utterance can include multiple slots mentioned (<food> and <price> slots) and various actions triggered (<greeting> and <inform> acts). In our experiment, we use a naive way that is viewing a different set of slot or act combination as different labels, e.g., three slot sets <food>, <food, price>, and <price, location> belong to three different clusters.

Results
Classifier results are shown in Figure 1. We can observe that ConveRT, TOD-BERT-jnt, and TOD-GPT2 achieve the best performance, implying that pre-training with dialogue-related data captures better representations, at least in these sub-tasks. Moreover, the performance of ConveRT and TOD-BERT-jnt suggests that it is helpful to pre-train with a response selection contrastive objective, especially when comparing TOD-BERT-jnt to TOD-BERT-mlm. Moreover, most of the pre-trained models have a similar and high micro-F1 score in (d) system dialogue act prediction, as most of them   E le c tr a -g e n A lB RoBERTa, and AlBERT show the worst classification results. Especially in (b) intent classification and (c) dialogue slot tagging, some of them seem to have zero useful information to make a prediction.
Mutual information results using K-means clustering are shown in Figure 2. Due to the space limit, we report the results using GMM in the Appendix, as the two of them have similar trends. The x-axis is the number of clusters in each subplot, ranging from 4 to 256, and the y-axis is the ANMI score between a predicted clustering and its corresponding true clustering. In general, the mutual information probe results are similar to what we observe in the classifier probe. We can find that ToD-BERT-jnt and ConveRT are those with the highest mutual information, and they are usually followed by TOD-GPT2 and DistilBERT. Another observation is that representations from those pre-trained language models, especially the top ones, seem to have more connection with user intent and system dialogue act labels than domain and slot labels. The average ANMI scores across 12 models and 7 different number of clusters for intent and dialogue act are 0.193 ± 0.169 and 0.226 ± 0.107, respectively. But domain and dialogue slot only have 0.086 ± 0.087 and 0.077 ± 0.057 AMNI scores in average. We discuss each subplot in detail in the following: Figure 2 (a) and (b) show the mutual information between predicted clustering and the true domain labels on the MWOZ dataset. A user utterance seems to have higher domain mutual information than a system response. TOD-BERT-jnt, in this case, outperforms others by a large margin, achieving around 0.4 ANMI with 8 clusters. Figure 2 (c) is about user intent using user utterances. Con-veRT surpasses others by far in the mutual information of intent, achieving over 0.7 ANMI at 128 clusters when the true number of classes equals to 151. Other than the top three models (ConveRT, TOD-BERT-jnt, and DistilBERT), the remaining pre-trained models have ANMI scores lower than 0.2.  evaluation using the slot labels. When comparing (d) to (e), we can find that user utterances contain more slot information than system responses (Max around 0.35 and 0.25). It is not surprising because a user in task-oriented dialogue is usually the slot information provider, informing what location or which cuisine s/he prefers. ToD-BERT and Con-veRT perform similar in this case, still outperform others by a big margin.
Figure 2 (f) shows the mutual information for the predicted clustering of system dialogue acts. We can find that most of the pre-trained language models have shown a relatively high ANMI score (average 0.226) and closed the gap between their performance and the top model. ConveRT works the best, in this case, followed by TOD-BERT-jnt and TOD-GPT2, in which two of them seem to have similar ANMI scores.

More Analysis
Difference Between Probes Ideally, both probes should distinguish the goodness of different pretrained language models, i.e., features that can be easily classified or features with high correlation with true distributions are preferred. However, we found that although, in general, the trends we observe from two probing methods are similar, they are not the same in terms of the ranking. When comparing the ranking of GPT2 and DialoGPT models in Figure 1 and Figure 2, we found that they obtain almost the worse ANMI scores but work quite good in classification accuracy. This observation means that their representations of different classes are "close" to each other as a low ANMI score suggesting a more noisy clustering. Still, at the same time, it is not hard to find a hyperplane that can well discriminate those features.
We discuss some possible reasons for this interesting observation in the following. The first guess is that these features may not follow a Gaussian distribution, as we assume during clustering, suggesting that more advanced clustering techniques can be investigated in future work. The second guess is that these features have an unavoidable clustering noise that can be denoised or debiased easily by a strong supervision signal. The third

ConveRT
Cluster 1 (Failed Booking) i am sorry but dojo noodle bar is solidly booked at that time . i can try a different time or day for you . 1 moment while i try to make the reservation of table for 8 , friday at 16:30 . booking was unsuccessful . can you try another time slot ? i am very sorry i was unable to book at acorn guest house for 5 nights , would you like to try for a shorter stay ? i am afraid that booking is unsuccessful . would you like a different day or amount of days ?
Cluster 2 (Train Time) there are 5 trains available , may i book 1 for you that leaves at 7:40 and arrives at 10:23 ? tr0330 departs at 14:09 and arrives by 15:54 . would you like a ticket ? the tr2141 arrives by 15:27 . would you like me to reserve some seats for you ? i have train tr4283 that leaves cambridge at 5:29 and arrives in bishops stortford at 6:07 . would you like to make reservations ? i have a train that leaves cambridge 14:01 arriving in birmingham new street at 16:44 . would that work ?
Cluster 3 (Restaurant Request) there are 21 restaurant -s available in the centre of town . how about a specific type of cuisine ? there are 9 indian restaurant -s in centre what price range do you want ? i am sorry , there are no catalan dining establishments in the city centre . would you like to look for a different cuisine or area ? i found 4 restaurant -s with the name tandoori that serve indian food on the south , west , and east . do you have a location preference ? there are no singaporean restaurant -s , but there are cheap ones offering several different cuisines .
Cluster 4 (Confirm Booking) all set . your reference number is k2bo09vq . i have got you booked for 16:30 . the reference number is eq0yaq1g . your reservation was a success and the reference number is jtwxfm7m . i have got your booking set , the reference number is 9rmfgjma . i booked tr3932 , reference number is fiw5abo2 .  guess, which may be a possible reason, is that these features are clustered by some other factors that are not tested, and at the same time, the factors we are interested in are scattered in groups for different classes in a similar way. Intuitively, there are four clustering results shown in Figure 5, where GPT2 and DialoGPT may fall into the (d) clustering type, which has a lower mutual information score but higher classification accuracy.
As a result, we suggest a simple rule of thumb regarding which probing results. In short, the results of the classifier probe could be useful if a supervised approach for a downstream task is designed, e.g., user dialogue act prediction and dialogue state tracking. On the other hand, the mutual information probe is more effective for an unsupervised problem, e.g., utterance clustering and dialogue parsing tasks. Figure 3 and Figure 4, we visualize the embeddings of TOD-BERT-jnt and GPT2 given the same system responses from the MWOZ test set. Each point is reduced from its highdimension features to a two-dimension point using the t-distributed stochastic neighbor embedding (tSNE). We use different colors to represent different domains (left), dialogue acts (middle), and turn slots (right). As one can observe, TOD-BERT-jnt has more clear group boundaries and better clustering results than GPT2. Visualization plots for other pre-trained models are shown in the Appendix. What utterances are clustered together? In Table 3, we show the clustering examples of system responses from the top performance model Con-veRT. We use K = 32 clustering and randomly select five clusters and five samples each. We found that most of the utterances are related to an unsuccessful booking in the cluster 1, containing "I am sorry," "solidly booked," or "booking was unsuccessful." We also found other clusters showing good clustering results, such as selecting departure or arrival time for a train ticket or requesting more user preference for a restaurant reservation. More clustering results are shown in the Appendix.

Visualization In
We investigate representations from pre-trained language models for task-oriented dialogue tasks, including domain identification, intent detection, slot tagging, and dialogue act prediction. We use a supervised classifier probe and a proposed unsupervised mutual information probe. From the ranking results of two different probings, we show a list of interesting observations to provide model selection guidelines and shed light on future research towards a more advanced language modeling learning for dialogue applications.    i can recommend the allenbell . it s in the east , is cheap yet has a 4 star rating and free wifi and parking . can i help you book ? the university arms is an expensive , 4 star hotel with free wifi . comparatively , the alexander bed and breakfast is a cheap -ly priced guesthouse , also 4 stars . i have found the guesthouse you were wanting . would you like me to book this for you ? how about the express by holiday inn cambridge , it s in the east . the expensive 1 is actually not much more than the other 2 . i would highly recommend it . that would be at the express by holiday inn cambridge . it s in the east .
Cluster 4 (Hotel Inform) the address is hills road city centre their address is unit g6 , cambridge leisure park , clifton road . the postcode is cb17dy . the address is corn exchange street . is there anything else i can help you with ? yes , the phone number is 01223277977 . the address is hotel felix whitehouse lane huntingdon road , and the post code is cb30lx . want to book ? the bridge guest house is at 151 hills road and their number is 01223247942 .
Cluster 5 (Welcome/End) you are welcome . is there anything else i can help you with today ? great . is there anything else that you need help with ? is there anything else that you would like ? no problem . can i help you with anything else ? is there something else i can help you with then ? Table 6: Clustering results of the TOD-BERT-jnt model. The samples are randomly picked from each randomly selected five clusters (using K=32). Cluster 2 your reference number is x5ny66zv . i booked tr3932 , reference number is fiw5abo2 . nusha is in the south , and the phone number is 01223902158 . they are located at 12 lensfield road city centre , postcode cb21eg , and phone number 01842753771 . it would cost 16.50 pounds .
Cluster 3 i hope i have been of help the entrance fee is free . anything else i can do for you today ? sure , lookout for a blue volvo the contact number is 07941424083 . can i help with anything else ? 1 moment while i try to make the reservation of table for 8 , friday at 16:30 . i have 3 options for you 2 in the north in the moderate price range and 1 that s expensive in the east .
Cluster 4 when would you like to leave and arrive ? booking was unsuccessful . can you try another time slot ? on what day will you be traveling ? tr3823 will arrive at 16:55 , would that work for you ? okay , what day did you have in mind ?
Cluster 5 saffron brasserie is an expensive restaurant that serves italian food there are 21 restaurant -s available in the centre of town . how about a specific type of cuisine ? i have 5 different restaurant -s to choose from . there are 4 in the centre of town , and 1 in the west . do you have a preference ? i have about 5 different entertainment venue -s if that is what you are looking for . do you have a preference on the area its located in ? there are no colleges close to the area you are requesting , would you like to chose another destination ? Table 7: Clustering results of the GPT2 model. The samples are randomly picked from each randomly selected five clusters (using K=32).

DialoGPT
Cluster 1 it is located in jesus lane your booking was successful , the reference number is waeyaq0m . may i assist you with anything else today ? your booking is successful ! your reference number is iigra0mi . do you need anything else ? 1 moment while i try to make the reservation of table for 8 , friday at 16:30 . this booking is successful for 1 night . your reference number is 85bgkwo4 . is there anything else i can assist you with ?
Cluster 2 sure , how many days and how many people ? i recommend castle galleries and it s free to get in ! i have plenty of trains departing from leicester , what destination did you have in mind ? i have 5 colleges in the centre area . what specific college are you looking for ? oh yes quite a few . which part of town will you be dining in ?
Cluster 3 i have many options available for you ! is there a certain area or cuisine that interests you ? there are lots to choose from under that criteria . what day would you like to travel on ? actually all 5 have free wifi . what star rating would you like ? i have found the guesthouse you were wanting . would you like me to book this for you ? yes , the hamilton lodge has internet .
Cluster 4 its entrance fee is free . sure , lookout for a blue volvo the contact number is 07941424083 . can i help with anything else ? how many people is the reservation for ? how about train tr3934 ? it leaves at 12:34 and arrives at 13:24 . travel time is 50 minutes . sure , the phone number is 01223902112 and they are in postcode cb58sx . can i help you with anything else today ?