Self-Supervised Contrastive Learning for Efficient User Satisfaction Prediction in Conversational Agents

Turn-level user satisfaction is one of the most important performance metrics for conversational agents. It can be used to monitor the agent’s performance and provide insights about defective user experiences. While end-to-end deep learning has shown promising results, having access to a large number of reliable annotated samples required by these methods remains challenging. In a large-scale conversational system, there is a growing number of newly developed skills, making the traditional data collection, annotation, and modeling process impractical due to the required annotation costs and the turnaround times. In this paper, we suggest a self-supervised contrastive learning approach that leverages the pool of unlabeled data to learn user-agent interactions. We show that the pre-trained models using the self-supervised objective are transferable to the user satisfaction prediction. In addition, we propose a novel few-shot transfer learning approach that ensures better transferability for very small sample sizes. The suggested few-shot method does not require any inner loop optimization process and is scalable to very large datasets and complex models. Based on our experiments using real data from a large-scale commercial system, the suggested approach is able to significantly reduce the required number of annotations, while improving the generalization on unseen skills.


Introduction
Nowadays automated conversational agents such as Alexa, Siri, Google Assistant, Cortana, etc. are widespread and play an important role in many different aspects of our lives. Their applications vary from storytelling and education for children to assisting the elderly and disabled with their daily activities. Any successful conversational agent should be able to communicate in different languages and accents, understand the conversation * Work done as an intern at Amazon Alexa AI.
context, analyze the query paraphrases, and route the requests to various skills available for handling the user's request (Ram et al., 2018).
In such a large-scale system with many components, it is crucial to understand if the human user is satisfied with the automated agent's response and actions. In other words, it is desirable to know if the agent is communicating properly and providing the service that is expected by the user. In the literature, it is referred to as targeted turn-level satisfaction as we are only interested in the user's satisfaction for a certain conversation turn given the context of the conversation, and not the overall satisfaction for the whole conversation (Park et al., 2020). Perhaps the most basic use of a user satisfaction model would be to monitor the performance of an agent and to detect defects as a first step to fix issues and improve the system. Anticipating user dissatisfaction for a certain turn in a conversation, an agent would be able to ask the user for repeating the request or providing more information, improving the final experience. Also, a powerful user satisfaction model can be used as a ranking or scoring measure to select the most satisfying response among a set of candidates and hence guiding the conversation.
The problem of user satisfaction modeling has recently attracted significant research attention (Jiang et al., 2015;Bodigutla et al., 2019;Park et al., 2020;Pragst et al., 2017;Rach et al., 2017). These methods either rely on annotated datasets providing ground-truth labels to train and evaluate (Bodigutla et al., 2019) or rely on ad hoc or human-engineered metrics that do not necessarily model the true user satisfaction (Jiang et al., 2015). Access to reliable annotations to be used in building satisfaction models has been very challenging partly due to the fact that a large-scale conversation system supports many different devices as well as voice, language, and application components, providing access to a wide variety of skills. The traditional approach of collecting samples from the live system traffic and tasking human annotators to label samples would not be scalable due to the cost of annotations as well as the turn-around time required to collect and annotate data for a new skill or feature. Note that onboarding new skills in a timely manner is a crucial to ensure active skill developer engagement.
To address this problem, we propose a novel training objective and transfer learning scheme that significantly improves not only the data efficiency but also the model generalization to unseen skills. In summary, we make the following contributions: • We propose a contrastive self-supervised training objective that can leverage virtually any unlabeled conversation data to learn useragent interactions.
• We show that the proposed method can be used to pre-train state-of-the-art deep language models and the acquired knowledge is transferable to the user satisfaction prediction.
• We suggest a novel and scalable few-shot transfer learning approach that is able to improve the label efficiency even further in the case of few-shot transfer learning.
• We conduct extensive experiments using data from a large-scale commercial conversational system, demonstrating significant improvements to label efficiency and generalization.

Related Work User Satisfaction in Conversational Systems
The traditional approach to evaluating a conversational system is to evaluate different functionalities or skills individually. For instance, for a knowledge question answering or web search skill, one can use response quality metrics commonly used to evaluate search system and ranking systems such as nDCG (Järvelin et al., 2008;Hassan, 2012;Fox et al., 2005). While these methods provide justifiable measures for certain skills, they are not extendable to a large number of skills, especially for skills without a set of proper hand-engineered features and metrics, or newly developed third-party skills (Bodigutla et al., 2019). Another, more general, line of research is to evaluate the performance of a conversation system from the language point of view. Here, the objective is to measure how natural, syntactically and semantically, an automated agent is able to interact with a human user. For instance, using generic metrics such as BLEU (Papineni et al., 2002) or ROUGE (Lin, 2004) one can measure how the agent's responses are consistent with a set of provided ground-truth answers. However, these approaches not only suffer from shortcomings such as inconsistency with the human understanding (Liu et al., 2016;Novikova et al., 2017) but also are not practical for a real-world conversation system due to their dependence on ground-truth responses.
A more recent approach is to use human annotations specifically tailored for the user satisfaction task as a source of supervision to train end-to-end prediction models (Bodigutla et al., 2019). Jiang et al. (2015) suggested training individual models for 6 general skills and devised engineered features to link user actions to the user satisfaction for each studied skill. Park et al. (2020) proposed a hybrid method to learn from human annotation and user feedback data that is scalable and able to model user satisfaction across a large number of skills.

Contrastive Learning
Gutmann and Hyvärinen (2010) was the first study to propose the idea of noise-contrastive learning in the context of a capturing a distribution using an objective function to distinguish samples of the target distribution from samples of an artificially generated noise distribution. Contrastive predictive coding (CPC) (Oord et al., 2018) suggested the idea of using an NCE objective to train an autoregressive sequence representation model. Deep InfoMax (Hjelm et al., 2018) used self-supervised contrastive learning in an architecture where a discriminator is trained to distinguish between representations of the same image (positive samples) or representations of different images (negative samples). While many different variations of contrastive methods have been suggested, the main idea remains the same: defining a self-supervised objective to distinguish between the hidden representations of samples from the original distribution and samples from a noise distribution (Trinh et al., 2019;Devon et al., 2020;Yao et al., 2020).

Few-shot Transfer Learning
Few-shot transfer learning is a very active and broad subject of research. We limit the scope of our study to methods in which a form of gradient supervision is provided by a target task to ensure the efficient transferability of representations trained on a source task. Lopez-Paz and Ranzato (2017) suggested the idea of joint multi-task training and using the cosine similarity of the concatenated network gradients from the source and target tasks. For gradients with negative cosine distance, they project the source gradients to a more aligned direction by solving a quadratic programming problem. Luo et al. (2020) continued that line and suggested a method in the context of few-shot transfer learning, showing that using even a few samples from the target task can significantly improve the transferability of the trained models. Li et al. (2020) presented a similar idea but suggested adjusting learning rates for each layer to improve the cosine similarity of different tasks. While these methods show promising results, they only measure the similarity between concatenated gradient vectors consisting of all network parameters which is a very rough measure of alignment. Also, they require solving for a quadratic or iterative optimization problem as an inner loop in the training procedure that can be computationally expensive and often prohibitive for large-scale problems.

Problem Definition
In this paper, we consider the conversational interaction between a human user and an automated agent. Each interaction consists of a set of turns in which the user provides an utterance and the agent provides appropriate responses. A set of turns that are happening within a certain time window are grouped as a conversation session. Formally, we can represent a session as a set of turns: Here, S i represents session i consisting of a set of turns as tuples of utterance and responses, (U t i , R t i ), for the first turn t = 0 to the last turn t = T in that session.
In the context of turn-level user satisfaction modeling, we are interested in the classification of a certain targeted turn within a session as either satisfying (SAT) or dissatisfying (DSAT). Note that the satisfaction here is defined based on the agent's response given a certain utterance and the context (i.e., other session turns). We use the notation Y t * i ∈ {SAT, DSAT} to indicate the user satisfaction for the targeted turn t = t * of session i. See Figure 2 for examples of SAT/DSAT interactions.

Datasets 1
In this study, we use real-world data from Alexa, a large-scale commercial conversational agent. Specifically, we use a dataset of about 891,000 real-world conversation sessions in which a certain turn within each session is annotated by a human annotator as SAT or DSAT. Human annotators had access to the session context and followed a standard labeling protocol (further information is provided in Appendix A). As a preprocessing step, we limited turns within each session to a window of five turns: at most two turns before the targeted turn, the targeted turn, and at most two turns after the targeted turn. This labeled dataset is denoted as D sup .
In addition to D sup , we also use a large pool of real-world session data without any annotation or label. This dataset is about twice the size of D sup , but as we are not limited to targeted turns, we keep all session turns and decide context windows based on a randomized data augmentation step. The resulting effective sample size is significantly larger than D sup . We denote this unlabeled dataset as D unsup . As both datasets were sampled from real traffic, we ensured that there is no overlap between D unsup and the evaluation splits of D sup .
The conversations cover a wide variety of internally developed (1p) and third-party (3p) developer skills. Due to the imbalanced traffic, in our Figure 1: Overview of the suggested network architecture. In our architecture, BERT encoder with average pool at the last layer is used as the LM. We consider a context window of at most 2T+1 turns. Heads are simple MLPs classifiers with one hidden layer.

Property
Size Total number of samples ≈ 891, 000 Total number of 1p skills > 20 Total number of 3p skills > 1500 Ratio of SAT to DSAT samples > 20 datasets, there is a huge variation between the number of samples for different skills. For instance, 1p skills such as music or weather have hundreds of thousands of samples while many 3p skills only have less than 10 samples throughout our datasets. To properly evaluate the performance of our predictors on such imbalanced data, we proposed a novel approach to split the data and to evaluate. We build two test sets: a test set measuring in-domain performance and another test set to measure the out-of-domain generalization. The in-domain test set consists of samples from skills that the train set covers. The out-of-domain test set measures the performance on skills that are not covered by the train set. Ideally, we would like to observe good classification performance in both test splits, indicating the ability of our models to learn and model the current major traffic and to generalize to less frequent or future traffic. Based on this, we split D sup to 70% train, 15% validation, and the rest for the test (about 1/5 of test samples are out-ofdomain and 4/5 are in-domain). The in-domain and out-of-domain test sets consist of 17 and 275 skills, respectively. The D unsup is randomly split to 80% train and the rest for validation, regardless of skills. Table 1 presents a summary of dataset statistics for D sup . Figure 1 shows a high-level drawing of the network architecture used in our experiments. It consists of a language model (LM) that encodes utterance and response pairs to vector representations. Here, we consider up to T turns before and after the targeted turn. To further summarize the list of the previous or next turns, we use GRU layers (Chung et al., 2014). Then, an average pool is used to produce a representation vector, z, for each session. Note that before the pooling, simple non-linear MLPs are used to transform each partial representation. Finally, z is used as an input to a set of different head networks, responsible for making predictions for different objectives.

Network Architecture
Regarding the LM, we use the standard BERT encoder (Devlin et al., 2018) architecture pre-trained as suggested by Liu et al. (2019). To make a fixedlength representation of the utterance response pairs i.e. turn semantics, we use an average pool at the last encoder layer of the BERT token representations. We also tried other approaches such as using the classification token instead of pooling, but based on our initial results simple pooling performed consistently better.
We share our BERT-based LM parameters across the network to encode the session turns. However, we train separate GRU networks to summarize the previous and next turns. The output dimension of the LM is equal to 768, the size of the standard BERT hidden layer. The hidden layer and output size of our GRUs are 256, and we use 2-layer bidirectional GRUs. Each head is a simple MLP with a single hidden layer of size 256 followed by a ReLU nonlinearity. The final network consists of about 117.7 million parameters from which about 110 million is related to BERT and the rest is for GRUs, heads, etc.

Supervised Learning Baseline
As a baseline approach, we use the network defined in Section 3.3 with a binary classification head to distinguish SAT and DSAT samples. Here, we use labels provided by D sup and a binary crossentropy (BCE) loss function. An Adam optimizer (Kingma and Ba, 2014) with a batch size of 512 is used to train the network for 10 epochs. The base learning rate for all non-BERT layers is set to 10 −3 , while for BERT layers, we use a smaller learning rate of 5 × 10 −5 . The learning rates are decayed with a factor 5 twice at 60% and 80% of total iterations. Unless indicated otherwise, we use a similar training setup for other experiments suggested in this paper.

Self-Supervised Objective
We define a self-supervised objective in which the model is tasked to distinguish real sessions from unreal (or noisy) sessions. Any unlabeled dataset, such as D unsup can be used to sample real sessions. To generate unreal textual information, different approaches have been suggested in the literature such as back-translation (Fang and Xie, 2020), generative modeling (Liu et al., 2020), or even random word substitutions.
In this work, we leverage the multi-turn and structured nature of sessions to generate noise samples by simply shuffling the targeted utterances/responses within each training batch (see Figure 3 for an example). Intuitively, the noise samples are sessions in which the targeted utterance or response does not belong to the rest of the session. Therefore, the model has to capture the joint distribution of the context and targeted turns. Algorithm 1 shows an overview of the sample generation and training process for the proposed contrastive objective. Algorithm 1 Contrastive Self-Supervised Training

Contrastive Pretraining
The objective introduced in Section 4.1 is not directly applicable to be used as a user satisfaction model. One approach to leverage the pool of unsupervised data is to pre-train the model on unlabeled data using the self-supervised objective, and then attach a classifier head and finetune the network to distinguish SAT and DSAT samples. In our implementation, we pre-train using the self-supervised objective on D unsup for 10 epochs, then train a classifier head on D sup for another 10 epochs; adjusting the learning rates for the network body to ×0.1 of the base learning rates (see Section 3.4 for more information on the learning rate setup).

Few-Shot Learning
In the pretraining approach, we solely relied on the loose semantic relationship between the selfsupervised and the user satisfaction modeling tasks. However, it is desirable to have a representation that is not only solving the self-supervised task but is also useful for the final objective. In other words, we have a source task (S) which we have a large number of training samples and a target task (T ) with a limited number of samples that is our main interest. The idea is to use information from the target task during the source training such that the trained model is most compatible with the target.
Let us assume we have datasets D S and D T corresponding to the source (S) and target (T ) tasks as well as inference functions for each task: f S (.|θ, ω S ) and f T (.|θ, ω T ). In this notation, θ rep-resents shared network parameters (i.e., the body in our architecture) and ω represents task-specific parameters (i.e., a head in our architecture). Formally, when optimizing for task S, we are interested in: where L S is the loss function for the source task. A simple gradient descent step to solve this problem can be written as: (3) However, we are interested in optimization steps that do not increase the loss value for task T : Considering (4) as an optimization constraint can potentially halt the optimization because improvements to the source objective do not directly translate to improvements to the target task. In other words, the constraint above may not be always directly satisfiable using gradient steps in the source domain.
To overcome this issue, instead of using gradient descent, we define the problem as a Randomize Block Coordinate Descent (RBCD) (Nesterov, 2012;Wright, 2015) optimization. At each RBCD iteration, only a subset of model parameters, i.e. a block noted as b, is sampled from a distribution B and used for the gradient descent update 2 : , y)) .
(5) Note that we only use the RBCD optimization for the network body parameters (θ), while the head parameters (ω S and ω T ) are optimized using a regular gradient descent optimization.
In this work, we propose the idea of adjusting the block selection distribution, B, such that parameters having more aligned source and target gradients have more chance of being selected: where the inputs to L S and L T are omitted for brevity. Intuitively, (6) is used to discourage parameter updates that are not aligned with the T task 2 Note that the block selection operation is discrete, either a certain parameter belongs to the block or not, but the distribution B can be a continuous or discrete probability distribution. which can be viewed as a soft method to enforce the constraint in (4). Here, there are multiple options to define the granularity of the block selection such as layer-wise, neuron-wise, or element-wise. Based on our initial experiments, we found that defining the block elements to be layer-wise results in the best performance.
Algorithm 2 shows an outline of the proposed method. At each iteration within the training loop, we back-propagate the S and T losses and store the gradients of layer parameters. For parameters related to the S head, we follow a simple gradient descent update. For body parameters, we only update the parameters if the inner product of the S and T tasks is positive or at a small random outcome with the probability of α. To guarantee the convergence of the source task, we allow all parameters to be selected at each step at least with a very small probability of α. In our experiments, we consider α as a hyperparameter taking values in {0.001, 0.005, 0.01, 0.05, 0.1}. Additional care is required when updating the T head layer parameters as the D T is usually much smaller than D S and the T head is prone to overfitting. We use a validation set from task T to detect overfitting for the T head and early stop the updates. Note that a hyperparameter λ is used to set the frequency of the T head updates after the early stopping. Having less frequent head updates allows the T head to gradually improve and adapt to the changes in the body without getting overfitted. In our experiments, we search for proper λ values in {0.001, 0.002, 0.005, 0.01}.
In contrast to other works in the literature which mostly leverage the alignment of concatenated gradients (Lopez-Paz and Ranzato, 2017;Luo et al., 2020), we propose layer-wise similarity measurements providing more granularity and more adaptability. Also, the suggested approach does not require any inner loop optimization process or gradient projection and hence is scalable to large-scale problems. The only computational and memory overhead is to store the model gradients with respect to each task and to compute inner products between the layer parameters.
The method explained in this section is general to few-shot transfer learning and joint training settings where a large source dataset is being used to achieve representations that are most useful for a final target task. For our use-case, we use the suggested approach considering the source task, S,

Algorithm 2 The Proposed Few-Shot Training
Backprop and store loss S loss T ← L T (f T (x T , y T )) Backprop and store loss T // Layer-Wise RBCD update for P in LayerP arameters : as the self-supervised contrastive objective and the target task, T , as the user satisfaction prediction task. In our experiments, after the joint training process, we reinitialize the T head and finetune the network for the T task. We found this approach to be helpful to achieve the best results as the jointly trained T head is often slightly overfitted.

Experimental Setup
We used PyTorch (Paszke et al., 2017) to train our models. For each case, we continue the training for the maximum number of epochs (10 in our experiments) and select the best model based on the validation performance. We conducted our experiments on a cluster of 48 NVIDIA V100 GPUs (16 GB memory, 8 GPUs per machine). It took between about 6 hours to 27 hours to run individual experiments, depending on the case.
For each experiment, we report Area Under the ROC Curve (AUC-ROC) and Area Under the Precision-Recall Curve (AUC-PR) as the performance measures. The results for the in-domain and out-of-domain held-out test sets are reported separately. Note that there is an imbalance in the frequency of SAT and DSAT labels, and also there is a difference in the label distribution for the indomain and out-of-domain test sets. To ensure the statistical significance of the results, each experiment is repeated four times using random initializations reporting the mean and standard deviations. Figure 4 shows a comparison of the in-domain test results for the supervised training and the selfsupervised contrastive pretraining methods. For each case, we report the in-domain test performance using models trained with a different number of annotated training samples. The x-axis is plotted in the log scale. It can be seen that the contrastive self-supervised approach is much more data-efficient compared to the supervised approach as it leverages the pool of unlabeled data. Figure 5 shows a comparison between the supervised training and the self-supervised pretraining methods on the out-of-domain test set. Similar to the in-domain case, there is a significant gap between the labeled data efficiency of these approaches. However, compared to the in-domain case, using even all training samples, the gap does not appear to close. In other words, for the outof-domain test set the self-supervised approach is not only more data-efficient but also tends to generalize better. In a real-world conversation system, the out-of-domain generalization can be crucial as many different new skills are being developed and included in the system every day, making the traditional in-domain human annotation less practical due to the required annotation turnaround time.

Quantitative Results
In Figure 6 and Figure 7, we compare the indomain and out-of-domain performance of the selfsupervised pretraining method with the proposed few-shot learning method. As it can be seen from Figure 6, the in-domain AUC-PR and AUC-ROC for the few-shot learning are consistently better than the self-supervised pretraining approach. Note that the performance gap closes at about 5000 samples; perhaps because it is enough training data for fine-tuning and successfully transferring the pretrained model. The out-of-domain performances as reported in Figure 7 show better results for the few-shot approach but the margin of improvement is relatively smaller than the in-domain case.
Note that in the presented results, we focused our comparisons to methods that are scalable and lever-  age human annotation data for turn-level satisfaction prediction, excluding approaches using humanengineered and skill-specific metrics as well as methods that only consider the quality of conversation from the language perspective. Table 3 in Appendix B presents a qualitative comparison of the baseline supervised training and the self-supervised approach suggested in this paper. Here, to highlight the generalization and dataefficiency of each method, we limit the number of annotated samples to 1024 random samples from the training set of the D sup dataset. For this table, we provide sample sessions that are chosen with an emphasis on more difficult requests, unclear requests, or requests involving 3p skills. U and R indicate the targeted utterance and response, while U + x and R + x indicate the context utterance and responses appearing x turns after the targeted turn. From the provided examples, it can be inferred that the self-supervised approach provides a deeper understanding of the user-agent interaction and is able to generalize better even for infrequent 3p skills. It is consistent with the quantitative results presented in the paper.

Conclusion
This paper suggested a self-supervised objective to learn user-agent interactions leveraging large amounts of unlabeled data available. In addition to the standard fine-tuning approach, this paper presented a novel few-shot transfer learning method based on adjusting the RBCD block selection distribution to favor layer parameters with source and target gradients pointing in similar directions. According to the experiments using real-world data, the proposed approach not only requires significantly less number of annotations, but also generalizes better for unseen out-of-domain skills.

A Annotation Protocol
In the following, we provide a summary of main points considered to produce the annotations used in this paper 3 : • Human annotators were trained to annotate samples i.e., we do not use domain specific metrics or other automated success measures as annotation.
• It was made clear to the annotators that the task is turn-level user satisfaction, and not the overall satisfaction over session. Also, instructions were provided on how to handle ASR errors, repeated requests, multiple users in one utterance, and many other special cases.
• The annotators were provided the targeted turn as well as a few context turns. This helped them to better understand the actual user intention and judge accordingly.
• They were asked to rate the system's response quality in terms of the user satisfaction on the scale of 1 to 5, from terrible to excellent. See Table 2 for score categories and an example of each category.
• To ensure the quality of annotations, each sample was annotated multiple times by different annotators. In our analysis, we consider all samples having a score of 3 or better as SAT, and DSAT otherwise. Also, in our data pipeline, we considered different annotations of the same utterance as different samples. However, care was taken in the data split process to ensure there is no train data contamination in our validation and test sets. Table 3 presents a qualitative comparison of the baseline supervised training and the self-supervised approach suggested in this paper. Here, to highlight the generalization and data-efficiency of each method, we limit the number of annotated samples to 1024 random samples from the training set of the D sup dataset. For this table, we provide sample sessions that are chosen with an emphasis on more difficult requests, unclear requests, or requests involving 3p skills. U and R indicate the targeted utterance and response, while U +x and R+x indicate the context utterance and responses appearing x turns after the targeted turn.

B Qualitative Results
From these examples, it can be inferred that the self-supervised approach provides a deeper understanding of the user-agent interaction and is able to generalize better even for infrequent 3p skills. It is consistent with the quantitative results presented in the paper.