Joint Turn and Dialogue level User Satisfaction Estimation on Multi-Domain Conversations

Dialogue level quality estimation is vital for optimizing data driven dialogue management. Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features and rely on complex annotation schemes, which reduce the generalizability of the trained models. We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function in order to jointly predict turn-level Response Quality labels provided by experts and explicit dialogue-level ratings provided by end users. The proposed BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating, implicitly encodes temporal dependencies, and removes the need to hand-craft features. On dialogues sampled from 28 Alexa domains, two dialogue systems and three user groups, the joint dialogue-level satisfaction estimation model achieved up to an absolute 27% (0.43->0.70) and 7% (0.63->0.70) improvement in linear correlation performance over baseline deep neural net and benchmark Gradient boosting regression models, respectively.


Introduction
Automatic turn and dialogue level quality evaluation of end user interactions with Spoken Dialogue Systems (SDS) is vital for identifying problematic conversations and for optimizing dialogue policy using a data driven approach, such as reinforcement learning. One of the main requirements to designing data-driven policies is to automatically and accurately measure the success of an interaction. Automated dialogue quality estimation approaches, such as Interaction Quality (IQ) (Schmitt et al., 2012) and recently Response Quality (RQ) * Currently at LinkedIn, but did this work at Amazon. (Bodigutla et al., 2019a) were proposed to capture satisfaction at turn level from an end user perspective. Automated models to estimate IQ (Ultes et al., 2014;Schmitt et al., 2011;Asri et al., 2014) used a variety of features derived from the dialogueturn, dialogue history, and output from three Spoken Language Understanding (SLU) components, namely: Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), and the dialogue manager. RQ prediction models (Bodigutla et al., 2019a) further extended the feature sets with features derived from the dialogue-context, aggregate popularity and diversity of topics discussed within a dialogue-session.
Using automatically computed diverse feature sets and expert ratings to annotate turns overcame limitations suffered by earlier approaches to measure dialogue quality at turn-level, such as using sparse sentiment signal (Shi and Yu, 2018), intrusive solicitation of user feedback after each turn, and using manual feature extraction process to estimate turn-level ratings (Engelbrecht et al., 2009;Higashinaka et al., 2010).
For predicting user satisfaction at dialogue-level, IQ estimation approach was shown to generalize to dialogues from different domains (Schmitt and Ultes, 2015). Using annotated user satisfaction ratings to estimate dialogue-level quality, overcame the limitation with using task success (Schatzmann et al., 2007) as dialogue evaluation criteria. Task success metric does not capture frustration caused in intermediate turns and assumes the end user goal is known in advance. However, IQ annotation approach to rate each turn incrementally, lowered Inter Annotator Agreement (IAA) for multi-domain dialogues (Bodigutla et al., 2019b). Multi-domain dialogues are conversations that span multiple domains (Table 1) in a single dialogue-session. On the contrary, RQ ratings were provided for each turn independently and were shown to be highly con-sistent, generalizable to multiple-domain conversations and were highly correlated with turn-level explicit user satisfaction ratings (Bodigutla et al., 2019b). Furthermore, using predicted turn-level RQ ratings as features, end-user explicit dialoguelevel ratings for complex multi-domain conversations were accurately predicted across dialogues from both new and seasoned user groups (Bodigutla et al., 2019b). Earlier widely used approach, such as PARADISE (Walker et al., 2000), where the model is trained using noisy end dialogue ratings provided by users, did not generalize to diverse user population (Deriu et al., 2019).
Despite generalizing to different user groups and domains, both turn and dialogue level quality estimation models trained using annotated RQ ratings (Bodigutla et al., 2019a,b) used automated, yet hand-crafted features. Modern day SDS support interoperability between different dialogue systems, such as "pipeline based modular" and "end-to-end neural" dialogue systems. Hand-crafted features designed based on one system are not guaranteed to generalize to dialogues on a new system. RQ based dialogue-level satisfaction estimation models (Bodigutla et al., 2019b) did not factor in noise in explicit user ratings and used average estimated turn-level RQ ratings as a feature to train the model. Each turn's success or failure was assumed to have an equal contribution to the overall dialogue rating. However, a user might be dissatisfied even if most of the turns in the dialogue were successful (example in Appendix Table 8).
In order to address the aforementioned limitations with using hand-crafted features, we propose a LSTM (Hochreiter and Schmidhuber, 1997) based turn-level RQ estimation model, which implicitly encodes temporal dependencies and removes hand-crafting of turn and temporal features. Along with turn-level features that are not dialoguesystem or user group specific, we use features derived from pre-trained Universal Sentence Encoder (USE) embeddings (Cer et al., 2018) of an utterance and system response texts to train the model.  Pre-trained sentence representations provided by USE Transformer model achieved excellent results on semantic relatedness and textual similarity tasks (Perone et al., 2018).
Using an adaptive multi-task loss weighting technique (Kendall et al., 2017) and attention (Vaswani et al., 2017) over predicted turn-level ratings, we further extend the turn-level model to design a novel BiLSTM (Graves et al., 2013) based joint turn and dialogue-level quality estimation model. To test the generalization performance of the proposed approaches, we estimate turn and dialoguelevel ratings on multi-turn 1 multi-domain conversations sampled from three user groups, spanning 28 domains (e.g., Music, Weather, Movie & Restaurant Booking) across two different dialogue systems.
To the best of our knowledge, this is the first attempt to leverage noise adaptive multi-task deep learning approach to jointly estimate annotated turn-level RQ and user provided dialogue level ratings for multi-domain conversations from multiple user groups and dialogue systems.
The outline of the paper is as follows: Section 2 discusses the choice of RQ annotation. Section 3 & 4 presents the novel approaches to estimate turn and dialogue level quality ratings. Section 5 summarizes the turn and dialogue level data and presents our experimental setup. Section 6 provides an empirical study of the models' performance. Section 7 concludes.

Response Quality for Turn and Dialogue level Quality Estimation
Interaction Quality (IQ) (Schmitt et al., 2012) and and Task Success (TS) (Schatzmann et al., 2007) measures require an annotator to accurately determine the task that the user is aiming to accomplish through a dialogue, which is non-trivial for multidomain conversations (Bodigutla et al., 2019b).

Turn-level Dialogue Quality Estimation
In this section we discuss previous turn-level satisfaction estimation models trained using RQ ratings, their limitations and our approach to overcome them.
12/10/2019 use_fig1 . . . . . . Similar to Bodigutla et al. (2019a), we define a dialogue turn at time n as t n = (t u n , t s n ), where t u n and t s n represent the user request and system response on turn n respectively ( Figure 1). A dialogue session of N turns is defined as (t 1 :t N ). In experiments conducted by Bodigutla et al. (2019a), Gradient Boosting Regression (Friedman, 2001 model gave the best turn-level RQ prediction performance. Features used to train the model were derived from current turn (t n ), dialogue history (t 1:n−1 ) and next turn's user request (t u n+1 ). In addition to deriving domain-independent features from three SLU components, namely Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), and the dialogue manager, five new feature sets were introduced by the authors to improve the performance of the turn-level satisfaction estimation model.
Features used in the model were automatically computed, yet they were carefully hand-engineered (See Appendix Table 11). Features were handcrafted to identify and rank factors contributing to the predicted satisfaction rating, but these features do not generalize easily to different dialogue systems. Introduced originally by authors of RQ, "un-actionable request" feature was computed by identifying the presence of particular key words (e.g., "sorry", "i don't know") in the system's response. This rule-based feature does not generalize to a system that uses different set of phrases to indicate its inability to satisfy user's request. Even temporal dialogue level features computed over turns (t 1 :t n ) were also hand-crafted and computed by taking simple aggregate statistics (e.g., mean) over turn level features.

LSTM-based Response Quality Estimation Models
In order to overcome the limitation of hand-crafting temporal features, we propose using a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) based model to estimate turn-level satisfaction ratings sequentially on a continuous [1−5] scale. Rach et al. (2017) showed that by using only turn-level features, pre-computed temporal features were no longer required for estimating IQ using a LSTM network. To keep the turn-level dialogue quality estimation system causal (Li Tan, 2013), where the output at the current time step only depends on current and previous steps, we do not introduce bi-directionality (Graves et al., 2013) into the network architecture (See Figure  2). Unlike dialogue-level rating, which is computed at the end of a dialogue-session, only past dialogue-context is available to compute a turn's quality rating. Causality enables using turn-level model to optimize dialogue policies online. Models encoding sentences into embedding vectors have been successfully used in transfer learning and performing several downstream Natural Language Processing tasks (e.g., Classification and semantic textual similarity detection). Pre-trained sentence representations provided by Universal Sentence Encoder (USE) (Cer et al., 2018) model achieved excellent results on semantic relatedness and textual similarity tasks (Perone et al., 2018).
To address the limitation with using features derived from hand-crafted rules, we use feature sets which are derived from USE pre-trained (512 dimensional) embeddings from its transformer variant. We introduce a set of five features derived Figure 2: Uni-directional LSTM model to predict RQ ratings at each time-step to estimate dialogue quality at turn-level. eusr, esys and turn−f eatures are pre-trained Universal Sentence Encoder embeddings for user request, system response and rest of the features in Table 2 Table 11). Note ∼65% relative drop in number of features (48→ 17). resp., conf., avg., sim., #, & req. indicate response, confidence, average, similarity, count and request respectively.
from USE embeddings of user request and system response texts (See Table 2). These features are then concatenated with turn-level features obtained from the SLU (e.g., ASR confidence score), dialogue manager (e.g., system response) output and predicted intent and domain popularity statistics. Concatenated features are passed as input to each time-step of the uni-directional turn-level satisfaction estimation deep LSTM network (Figure 2), that minimizes mean square error loss between actual and predicted turn-level RQ ratings.

Dialogue-level Quality Estimation
In this section we discuss the novel joint turn and dialogue quality estimation approach.

Joint Estimation of Turn and Dialogue Level Ratings
Turn-level satisfaction estimation helps identify a particular turn's success from an end user's perspective. In addition to predicting whether individual turn was successful, we need a dialogue level user satisfaction metric for learning dialogue policies that maximize end user satisfaction on the overall dialogue. Dialogue-level metric also helps in identifying problematic dialogues which caused dissatisfaction to the end user. We propose a novel approach (Figure 3) to jointly predict turn and dialogue level satisfaction ratings for a given dialogue. Unlike turn-level satisfaction estimation, we are not constrained to use only historical context of a dialogue to predict the dialogue-level ratings as entire context of the dialogue is available while predicting a dialogue level rating. Hence instead of LSTMs we use deep BiLSTM (Graves et al., 2013) network for the dialogue-level satisfaction estimation task. Ultes (2019) showed that BiLSTMs with self-attention (Zheng et al., 2018) model gave the best performance on the IQ prediction task and the model implicitly encoded temporal dependencies. Feature inputs to the joint model are same as the ones we use for turn-level quality estimation in Section 3.1.
Individual turn's predicted RQ rating does not provide enough information to estimate whether an entire dialogue is satisfactory. Bodigutla et al. (2019b) used average turn-level predicted RQ ratings as feature to estimate dialogue-level quality. We hypothesize that users do not equally weigh each each turn's success (or failure) while determining end dialogue rating (Example conversation in Appendix Table 8). We apply attention (Vaswani et al., 2017) over turn-level ratings and concatenate the aggregate weighted turn-level rating with the entire dialogue's representation (hidden state h t N in Figure 3) before passing it through the sigmoid activation layer for dialogue rating prediction.
In the next section we describe the multi-task loss function we minimized for jointly estimating turn and dialogue-level quality ratings.  Figure 3: BiLSTM based joint turn and dialogue-level satisfaction estimation model. tive enough to provide correct feedback (Su et al., 2015). To address the difference in noisiness of labels provided for each task, we followed the approach by Kendall et al. (2017) to use homoscedastic (task-dependent) uncertainty to weigh losses from two tasks, where multi-task loss function is derived by maximizing Gaussian likelihood with homoscedastic uncertainty (Equation 1). Sufficient statistics f W (X) is the output of a neural network with weight W on input X. Yt (turn-ratings) and Y d (dialog-ratings) are model outputs.

Multi-task Loss
Equation 2 shows the multi-task loss function L we minimize. L t and L d are the mean square error losses computed on turn-level RQ ratings and dialogue-level user ratings respectively. Minimizing the objective functions with respect to noise parameters σ t and σ d is interpreted as learning the weights for L t and L d adaptively from the data. Higher the noise, lower is the weight of the corresponding loss. This method to weigh the losses using learnt weights helps in bringing the losses from the two tasks on the same scale as well.

Data and Experimental Setup
This section describes our turn and dialogue-level datasets and explains our experimentation setup.

Dialogue Quality Data
In order to test the generalizability of the turn and dialogue level user satisfaction models across different domains, user groups and dialogue systems, we sampled 3,129 dialogue sessions (20,167 turns)   (Table 3). These multi-domain dialogues (Example goals user try to achieve in Appendix Table 10) are representative of end user interactions with Alexa and were randomly sampled from two dialogue systems. Dialogue-system A uses a pipelined modular dialogue agent comprising of ASR, NLU, State Tracker, Dialogue Policy and Natural Language Generation components (Williams et al., 2016). Dialogue-system B is an end-to-end neural model (Ritter et al., 2011;Shah et al., 2018) that shares only the ASR component with system A (Fig. 4).
Each turn was rated by expert RQ annotators 2 and Dialogue level ratings were provided by end users. Users provided their satisfaction rating with the dialogue on a discrete [1 − 5] scale at the end of each session, irrespective of the outcome. Similar to Bodigutla et al. (2019b) the rating scale we asked the users to follow was 1=Very dissatisfied, 2=Dissatisfied, 3=Moderately Satisfied (or Slightly dissatisfied), 4=Satisfied and 5=Extremely Satisfied. Since earlier attempts to estimate explicit dialogue-level satisfaction ratings did not generalize to different user population (see section 1), we collected dialogue ratings from users belonging to "novice" (15%), "some experience" (33%) and "experienced" (52%) groups. A novice user has minimal experience conversing with the SDS and he/she has never used the functionality provided by the 28 domains prior to the study. A user with some experience has interacted with some (but not all) domains, whereas an experienced user is a seasoned user of Alexa and its domains.

Experimental Setup
This section describes the experimental setup we used for training and evaluating turn and dialogue level satisfaction estimation models.

Turn-level Dialogue Quality Estimation
Similar to Bodigutla et al. (2019a), we considered regression models for experimentation to predict turn-level satisfaction rating on a continuous [1−5] scale. We experimented with two variants of the turn-level satisfaction estimation model described in Section 3.1. In the first variant (LST M embedding ) we passed concatenated pretrained USE sentence embeddings of the user request and system response as input to each time step of the LSTM based model. In the second variant (LST M embeddings f eatures ) we concatenate USE embeddings with rest of the 15 turn-level features mentioned in Table 2. We benchmarked the performance of the two LSTM models against the best performing (Bodigutla et al., 2019a) turnlevel Gradient Boosting Regression model trained with 48 hand-crafted features (Appendix Table 11).

Dialogue-level Quality Estimation
We experimented with eight models to estimate dialogue level user satisfaction ratings. Three out of the eight models were used as baseline models, which are: 1) Gradient Boosting Regression (G.Boost) model trained using features derived from the entire dialogue context (t 1:N ), including hand-crafted turn-level and temporal features (See Appendix Table 11); 2) Two-layer BiL-STM model (BiLST M f eatures ) trained with all turn-level features (Table 2), except for the embeddings themselves; 3) BiLST M f eatures model with self-attention mechanism (BiLST M attn f eatures ), which is also a variant of best performing IQ estimation model (Ultes, 2019). For benchmarking we used best performing (Bodigutla et al., 2019b) G.Boost RQ dialogue-level quality estimation model, which used average predicted RQ rating as an additional feature to train the G.Boost model.
Remaining four models we experimented with comprised of two variants of our proposed BiLSTM based joint dialogue quality estimation model, that used attention over the predicted RQ ratings to predict dialogue level rating (See Section 4.1). First variant used only USE embeddings as features (Joint attn embeddings ) and the second one (Joint attn embeddings f eatures ) used all the turn-level features mentioned in Ta The joint models minimized adaptive weighted loss (Eq. 2). All the deep neural models we experimented with used Adam (Kingma and Ba, 2014) optimizer with learning rate 0.0001, mini-batch size of 64 and hidden vector size 512. We used early stopping criteria and (0.5) dropout (Srivastava et al., 2014) regularization techniques to avoid overfitting. Hyper-parameter ranges we experimented with are in Appendix Table 12.
For both dialogue and turn level quality estimation, dialogues were randomly split into training (80%), validation (10%) and test (10%) sets, so that turns from the same dialogue do not appear in both test and training sets.
We trained and evaluated the performance of the turn and dialogue-level quality estimation models on dialogues from dialogue-system A and from both systems A & B combined 3 . In the first case we used all turn-level features mentioned in Table  2. In the second case we excluded features derived from NLU as dialogue-system B did not use NLU output.

Evaluation Criteria
We used Pearson's linear correlation coefficient (r) for evaluating each model's 1-5 prediction performance. For the use case to identify problematic turns from an end user's perspective, it is sufficient to identify satisfactory (rating ≥3) and dissatisfactory (rating < 3) interactions (Bodigutla et al., 2019b). We used F-score for the dissatisfactory class as the binary classification metric, as most turns and dialogues belong to the satisfactory class. Dialogue-level ratings have a smoother distribution (Pearson's moment coefficient of skewness −0.27) over turn-level RQ ratings (skewness −0.64).

Results and Analysis
This section presents the turn and dialogue-level user satisfaction estimation results.  Table 4: Turn-level dialogue quality estimation models' performance measured using correlation between predicted and true ratings and F-score on dissatisfactory class (F-dissat.) Cells show the mean and 95% bootstrap confidence interval, highest mean in bold, for statistically significant improvement over benchmark Gradient Boosting Regression model's performance.

Turn-level satisfaction Estimation
As shown in Table 4, our proposed LSTM based turn-level quality estimation model outperformed the benchmark Gradient Boosting regression model and removed the need to hand-craft features. Even when NLU features were not used, on dialogues from both dialogue systems, the best-performing (LST M embedding f eatures ) model achieved ∼3% relative improvement in correlation (0.74 → 0.76) and statistically significant (at 95% boostrapconfidence interval) relative improvement 8.3% (0.72 → 0.78) in F-score on dissatisfactory class performance, over the benchmark model.

Analysis of turn-level model's performance on new domain
To further test the generalizability of the LST M embedding f eatures model to new domains, we wanted to verify that the model was not overfitting domain specific vocabulary. To achieve this, we trained the turn-level model with varying percentage of dialogues from a new "movie reservation & recommendation" domain hosted on dialogue System A. Training set consisted of dialogues from Sytems A&B and specified percentage of dialogues from the new domain 4 . Consistent with the results in (Bodigutla et al., 2019b), the prediction performance dropped when no dialogues from the new domain were in the training set (results in Appendix Table 14). However, when the model was trained with (randomly sampled) mere 10% (9% train, 1% validation) of dialogues (∼6% slotvalue coverage 5 ) , the prediction performance on F − dissatisf actory metric (0.75 ± 0.01) was at par (difference not statistically significant) with the overall performance achieved by the model when it was trained with 90% (80% train, 10% valid) dialogues (Table 4). Performance parity in-terms of Correlation (0.74 ± 0.03) was achieved when LST M embedding f eatures model was trained with 60% (54% train 6% validation) of dialogues (∼50% slot-value coverage). These two observations imply that binary prediction performance improvement requires training with fewer dialogues in comparison to the number of dialogues required to accurately identify the degree of user (dis)satisfaction.
In order to further understand the relationship between slot types and annotated RQ labels we calculated the Pointwise Mutual Information (PMI 6 ) score for the new domain, between its 8 slot-types and 5 RQ labels (total 40 values). Most of the dissatisfactory turns were associated with the system not interpreting the theater names (slot-types 'theater') and instructions containing numbers (e.g., "pick the fourth one") correctly. Validating our hypothesis that users do not perceive all turns' failures equally, based on the PMI scores, users seem more dissatisfied with system's failure to identify "theater" (RQ rating -1) over failure in interpreting numeric instructions (RQ rating -2) 7 . We calculated cosine similarity between the 40 dimensional PMI scores vector of (Slot type ,RQ labels ) in each selected training set, with PMI scores vector computed on entire set of dialogues in the new domain. As shown in Appendix Table 14, the turn-level model's performance on new domain improves with the similarity score. This observation suggests that the model is not overfitting to domain specific vocabulary (e.g., movie name), instead it learns the extent of user (dis)satisfaction to failures/success of different (slot) types of requests he/she makes.

Dialogue-level Satisfaction Estimation
As shown in Table 5  Including USE embeddings as features improved the performance of the dialogue-level satisfaction estimation models. Specifically on data from both systems, both BiLST M embeddings f eatures and BiLST M attn embeddings f eatures models achieved around absolute 15% -18% significant improvement in both correlation and F-score on dissatisfactory class performance over their respective counterparts BiLST M f eatures and BiLST M attn f eatures .

Analysis of learnt Attention Weights
For the Joint attn embeddings f eatures model, Table 6 shows the attention weights learnt on predicted turn level (RQ) and true RQ ratings for each turn of a sample dialogue. The joint model puts more weight on the dissatisfactory turns over the satisfactory ones and the dialogue was correctly identified as dissatisfactory. Table 7 shows an example dialogue where a generous (Kulikov et al., 2018) user was satisfied (dialogue-rating 4.0) even when the system did not offer any alternate time slots or restaurant suggestions when his/her initial request to book a table could not be fulfilled. However, the model predicted the dialogue as dissatisfactory.

Conclusions
In this paper, we proposed a novel approach to use annotated consistent turn-level Response Quality (RQ) ratings for dialogue level user satisfac-8 Results are not broken down further by domain, since a multi-domain conversation session comprises of turns which belong to ≥ one domain and context is shared between them. 9 This dialogue is not a real user conversation with live system.   tion estimation on conversations which span three user groups, 28 domains and two dialogue systems.
With the help of pre-trained Universal Sentence Encoder (USE) embeddings, we removed the need to hand-craft features. Leveraging noise adaptive weighting of multi-task loss technique and aggregating predicted RQ ratings using attention mechanism, we developed the BiLSTM based deep joint turn & dialogue level satisfaction estimation model. The best-performing joint-model achieved up to 27% absolute significant improvement in correlation (Pearson's-r) performance and 7% absolute improvement over the baseline deep neural network and the benchmark G.Boost models, respectively.
Analysis of the learnt attention weights showed that the joint model exhibited the desired behavior to weigh successful and failed turns unequally. However the model was not calibrated to factor in user's preferences and biases, which we plan to address in future work. Your card ending xxxx has been charged $50 1.0 Rating Description 1 Terrible (system fails to understand and fulfill user's request) 2 Bad (understands the request but fails to satisfy it in any way) 3 OK (understands users request and either partially satisfies the request or provides information on how the request can be fulfilled) 4 Good (understands and satisfies the user request, but provides more information than what the user requested or takes extra turns before meeting the request) 5 Excellent (understands and satisfies user request completely and efficiently