Optimizing NLU Reranking Using Entity Resolution Signals in Multi-domain Dialog Systems

In dialog systems, the Natural Language Understanding (NLU) component typically makes the interpretation decision (including domain, intent and slots) for an utterance before the mentioned entities are resolved. This may result in intent classification and slot tagging errors. In this work, we propose to leverage Entity Resolution (ER) features in NLU reranking and introduce a novel loss term based on ER signals to better learn model weights in the reranking framework. In addition, for a multi-domain dialog scenario, we propose a score distribution matching method to ensure scores generated by the NLU reranking models for different domains are properly calibrated. In offline experiments, we demonstrate our proposed approach significantly outperforms the baseline model on both single-domain and cross-domain evaluations.


Introduction
In spoken dialog systems, natural language understanding (NLU) typically includes domain classification (DC), intent classification (IC), and named entity recognition (NER) models. After NER extracts entity mentions, an Entity Resolution (ER) component is used to resolve the ambiguous entities. For example, NLU interprets an utterance to Alexa (or Siri) "play hello by adele" as in the 'Music' domain, 'play music' intent, and labels "hello" as a song name, "adele" as an artist name. ER queries are then formulated based on such a hypothesis to retrieve entities in music catalogs. Often times NLU can generate a list of hypotheses for DC, IC, and NER, and then a reranking model uses various confidence scores to rerank these candidates (Su et al., 2018).
Since ER is performed after NLU models, the current NLU interpretation of the utterance is limited to the raw text rather than its underlying entities. Even in NLU reranking (Su et al., 2018), only * The first two authors have equal contribution DC, IC, and NER confidence scores were used, and as a result, the top hypothesis picked by NLU reranking might not be the best interpretation of the utterance. For example, in the absence of entity information, "the beatles" in the utterance "play with the beatles" is interpreted as an artist name. If the reranker could search the ER catalog, it would promote the hypothesis that has "with the beatles" as an album name. Such NLU errors may propagate to ER and downstream components and potentially lead to end-customer friction.
In this work, we thus propose to incorporate ER features in the NLU reranking model, called NLU-ER reranking. For a domain, we use its corresponding catalogs to extract entity related features for NLU reranking for this domain. To enhance ER feature learning, we add a novel loss term when an NER hypothesis cannot be found in the catalog. One additional challenge arises in the multi-domain systems. In large-scale NLU systems, one design approach is to modularize the system as per the concept of domains (such as Music, Video, Smart Home), and each domain has its own NLU (DC, IC, NER) and reranking models that are trained independently. Under this scheme, each domain's NLU reranking plays an important role in both indomain and cross-domain reranking, since it not only ranks hypotheses within a domain to promote the correct hypothesis, but also produces ranking scores that need to be comparable across all different domains. In (Su et al., 2018), the scores for the hypotheses from different domains are calibrated through training on the same utterance data with similar models . However, we may only use NLU-ER reranking for some domains (due to reasons such as lack of entity catalog, different production launch schedule, etc.), and the scores from such rerankers may no longer be comparable with other domains using the original reranker model. To mitigate this issue, we introduce a score distribution matching method to adjust the score distributions.
We evaluate our NLU-ER reranking model on multiple data sets, including synthetic and real dialog data, and for both single domain and crossdomain setups. Our results show improved NLU performance compared to the baseline, and the improvement is contributed to our proposed ER features, loss term, and score matching method.

Related Work
Early reranking approaches in NLU systems use a single reranker for all the domains. Robichaud et al. (Robichaud et al., 2014) proposed a system for multi-domain hypothesis ranking (HR) that uses LambdaMART algorithm (Burges et al., 2007) to train a ranking system. The features in the ranking system include confidence scores for intents and slots, relevant database hits and contextual features that embed relationship to previous utterances. The authors showed improved accuracy in top domains using both non-contextual and contextual features. Crook et al. adapted a similar reranking scheme for multi-language hypothesis ranking (Crook et al., 2015). The set of features in the reranker include binary presence variables, for example presence of an intent, coverage of tagged entities and contextual features. They adapted the LambdaMART algorithm to train a Gradient Boosted Decision Trees model (Friedman, 2001) for cross language hypothesis ranking, and demonstrated comparable performance of the cross language reranker to the language-specific reranker. These models did not explicitly use ER signals for reranking. In addition, reranking is done across domains. Such single reranker approach is not practical in NLU systems with a large set of independent domains. In contrast, our approach emphasizes domain independence, allowing reranking to be performed for each domain independently. Furthermore, we rely on ER signal as a means to improve reranking.
To the best of our knowledge, the most related work to ours is Su et al. (Su et al., 2018), which proposed a re-ranking scheme to maximize the accuracy of the top hypothesis while maintaining the independence of different domains through implicit calibration. Each domain has its NLU reranker, and the scores for the hypotheses from reranking are compared across all the domains to pick the best hypothesis. The feature vector for each reranker is composed of intent, domain and slot tagging scores from the corresponding domain. Additionally, a cross entropy loss term is used to ensure calibra-tion across domains. In a series of experiments, they demonstrated improvement of semantic understanding. Our work is an extension of that work as we utilize ER signals, in addition to the DC, IC, and NER scores, and introduce a new loss term to improve the reranking accuracy.
To resolve the score non-comparable problem in a multi-domain system, traditional calibration methods utilize Platt Scaling or Isotonic Regression to calibrate the prediction distribution into a uniform distribution Elkan, 2001, 2002;Platt et al., 1999;Niculescu-Mizil and Caruana, 2005;Wilks, 1990). However, this does not work in our scenario since the data in different domains are imbalanced, which causes domains with big traffic to have lower confidence scores. Instead of using probability calibration methods, we propose a solution based on power transformation to match the prediction score distribution back to the original score distribution, thus making the scores comparable even after ER information is added to NLU reranking.

Reranking Model
The baseline NLU reranking model is implemented as a linear function that predicts the ranking score from DC, IC, and NER confidence scores. We augment its feature vector using ER signals and introduce a novel loss term that penalizes the hypotheses that do not have a matched entity in the catalog. Similar to (Su et al., 2018), we tested using a neural network model for reranking, but observed no improvements, therefore we focus on the linear model.

ER Features in Reranking
The features used in the baseline NLU reranker include scores for DC (d), IC (i), NER (n) hypotheses, and ASR scores that are obtained from upstream components and used for all the domains. The additional ER features used in NLU-ER reranker are extracted and computed from the ER system, and can be designed differently for individual domains. For example, in this work, for the Music domain, ER features we use are aggregated from NER slot types such as: SongName, ArtistName, and the ER features are defined as: ER success e s i : if a hypothesis contains a slot s i that is successfully matched by any of the ER catalogs, this feature is set to 1, otherwise 0. ER success feature serves as a positive signal to pro-mote the corresponding hypothesis score.
ER no match m s i : if a slot s i in a hypothesis does not have any matched entities in the ER catalogs, this feature value is 1, otherwise 0. ER no match feature serves as a negative signal to penalize the hypothesis score. We find 'ER no match' is a stronger signal than 'ER success' because over 90% of the time, ER no match implies the corresponding hypothesis does not agree with the ground truth.
Similarity feature l s i : this feature is nonzero only if the ER success feature e s i is 1. In each catalog, a lexical or semantic similarity score between the slot value and every resolved entity is computed, and the maximum score among them is selected as the feature value. This indicates the confidence of the ER success signal.
Not in Gazetteer: this feature is set to 1 when ER features are not in the gazetteer (neither ER success nor no match), otherwise 0. We will discuss the gazetteer in the next section.

ER Gazetteer Selection
Since NLU and reranking happen before ER, in runtime retrieving ER features from large catalogs for NLU reranking is not trivial. Therefore we propose to cache the ER signals offline and make it accessible in NLU reranking in the form of a gazetteer. To make the best use of the allocated amount of runtime memory, we design a gazetteer selection algorithm to include the most relevant and effective ER features in the gazetteer.
We define Frequent Utterance Database (FUD) as the live traffic data where the same utterance has been spoken by more than 10 unique customers. To formalize the selection procedure, we define outperforming and underperforming utterances by friction (e.g., request cannot be handled) rate f r and 30s playback queue (playback ≥ 30s) rate qr. For all FUD utterances in a given period, an utterance u is defined as outperforming if f r(u) ≤ µ f r −λ 1 * σ f r and qr(u) ≥ µ qr + λ 2 * σ qr , where µ and σ are the mean and standard deviation, λ 1 and λ 2 are hyperparameters. Underperforming utterances are defined likewise.
The detailed gazetteer selection algorithm is described in Algorithm 1. u h 1 , ..., u hn denote n-best NLU hypotheses of the utterance u. The idea is to encourage the successful hypotheses and avoid the friction hypotheses based on the historical data. For instance, if u is an underperforming utterance and u h 1 is ER_NO_MATCH, we want to penalize u h 1 to down-rank it, and promote other hypotheses u h i (i = 1) that receive the ER_SUCCESS signal. For the utterance hypotheses that are not selected in the gazetteer, we will use the Not_in_gazetteer (NG) feature.

NLU-ER Reranker
For an utterance, the hypothesis score y is defined as the following: The first part in (1) is the baseline NLU reranker model: (2) y = W G G where G = [g 1 , g 2 , . . . , g p ] T is the NLU general feature vector, W G = [w 1 , w 2 , . . . , w p ] is the corresponding weight vector. The rest of the features are ER related. 1 is the indicator function. S is the set of all slot types, ER s i = [er 1 , er 2 , . . . , er q ] T is the ER feature vector and W s i = [w s i 1 , w s i 2 , . . . , w s i p ] is the corresponding weight vector. If an utterance in Music only contains SongName slot s 1 , then y = W G G + W s 1 ER s 1 , the rest of the terms are all 0s. If an utterance does not have any ER features from all the defined slot types, y = W G G + w d . w d serves as the default ER feature value to the reranker when no corresponding ER features are found in the gazetteer described above. Its value is also learned during the model training.

Loss Function
We use SemER (Semantic Error Rate) (Su et al., 2018) to evaluate NLU performance. For a hypothesis, SemER is defined as E/T , where E is the total number of substitution, insertion, deletion errors of the slots, T is the total number of slots.
One choice of the loss function is the combination of expected SemER loss and expected cross entropy loss (Su et al., 2018). The loss function L u of an utterance is defined as: where S u is the expected SemER loss: S u = N i p i × SemER i , and C u is the expected cross entropy loss: , N is the number of hypotheses in utterance u.
Since our analysis showed that ER_NO_MATCH is a stronger signal and we expect the top hypothesis to get ER hits, we add a penalty term N u to the loss function to penalize the loss when the 1-best hypothesis gets ER_NO_MATCH.
Let r j = max i (r i ) be the best score in the current training step, and j the index for the current best hypothesis. Then no match loss term is defined as: where e i = #(slot er_no_match ) #(slot) . It is the ratio of the slots with ER_NO_MATCH to all the slots in the i th hypothesis, and if no slot gets ER_NO_MATCH, the loss term is zero. Then the overall loss function is updated as: N u will penalize more the hypothesis that has a high score but gets no ER hits. k 1,2,3 are the hyperparameters, L u is the final loss term for NLU-ER Reranker.
In our experiments, we observed that the weights are higher for the ER no match feature, and the model with the new loss term had a better performance under in-domain setup, which is as expected. Also, giving higher weight to 'ER no match' decreases the confidence scores generated by a domain NLU-ER reranker, which can help with the cross domain calibration problem. We will discuss how to ensure comparable scores in the next section.

Score Distribution Matching
Before adding the ER features, the reranking scores are calibrated through training on the same utterance data with similar models. However, adding the ER features in NLU reranking for a single domain may lead to incomparable scores with other domains. Using the loss function in Eq (3), we have the following theorem: Theorem 4.1. Under the loss function in Eq (3), assuming hypothesis 1 is the ground truth, and 0 = SemER 1 < SemER 2 < SemER 3 < SemER 4 < SemER 5 , with a uniform score assumption 5 j e y j = c; Eq (1) will obtain a higher positive label hypothesis score and a lower negative label score than Eq (2).
Proof. For the expected SemER loss S u , since it is the linear combination of SemER i , the solution of the minimization problem will be: p 1 → 1, p 2 = p 3 = p 4 = p 5 → 0. This leads to:y 1 → ∞, y 2 = y 3 = y 4 = y 5 → −∞. Then for the expected cross entropy loss C u , let x i = e y i , the minimization of C u becomes: x j log 1 1 + x j = min −I 1 −I 2 .
The first part (I 1 ) is monotonically increasing, while the second part (I 2 ) is monotonically decreasing when x j > 0. This also leads to: y 1 → ∞, y 2 = y 3 = y 4 = y 5 → −∞. Thus, solving the minimization problem minL u is equivalent to solving the linear system: when y → ∞ associated with the given loss in Eq (3), where F + is the feature matrix for the positive labels, F − is the feature vector for the negative labels, w is the weight vector we need to solve, and 1 + , 1 − are the unit vectors with the same dimension as the number of positive samples and negative samples respectively. We can rewrite Eq (6) into: F w = y, and its solution will be the projection associated with the loss in Eq (3) of y onto the solution space spanned by the column vectors of matrix F . Now define this projection as P F ( y). For the feature matrix of the NLU model in Eq (2), we have F N = G, and for the feature matrix of NLU-ER model in Eq (1) we have F ER = [G, ER s 1 , ER s 2 , . . . , ER sq , 1 def ault ]. Since F N is the submatrix of F ER , we have spanF N ⊂ spanF ER , thus: In Theorem 4.1, we show that the candidate hypothesis from a more complicated model will be more likely to have a higher score than the domains using the original reranker model. Thus the domains using the NLU-ER reranker are no longer comparable to the domains using the original model. We observed this scenario in our experiments empirically. When we only experiment with Music domain, it will generate higher confidence scores and have more false positives.
To solve this problem, since we would like the confidence scores for each domain to have stabilized variance and minimized skewness, we propose to use power transformation, which is able to map data from any distribution to an approximately standard Gaussian distribution. In our case, the confidence scores from Eq (1) might be zero or negative, thus we consider the Yeo-Johnson transformation with λ = 0 and λ = 2: We have the inverse function: where parameter λ is determined through maximum likelihood estimation. The idea is to first map both the NLU reranker model scores and the NLU-ER reranker scores to a standard Gaussian distribution and obtain λ N LU and λ N LU −ER . Then to calculate a new score from the NLU-ER reranker, we first use Eq (7) to transform the score into a standard Gaussian score with λ = λ N LU −ER , followed by Eq (8) to transform the standard Gaussian score back into the original NLU reranker scores with λ = λ N LU . Notice that when λ > 0, both Eq (7) and (8) are monotonic functions, thus the mapping method can only change the score distribution while maintaining the in-domain ranking order.

Experimental Setup
We use the following data sets for training and evaluation: Annotation Data (AD): It contains around 1 million annotated utterances from internal traffic. Training and testing split is 50:50. For testing, we further evaluate two different conditions: (i) 'AD All' using utterances from all domains for crossdomain evaluation. (ii) 'AD Music', 'AD Video', 'AD LS' using utterances from the Music domain, Video Domain and Local Search domain, respectively, for in-domain evaluation.
Synthetic Data (SD): These are synthetically generated ambiguous utterances used for in-domain evaluation. For Music and Video domains, utterances are in the form of "play X". Slot type of X could be ArtistName, SongName, AlbumName, VideoName, etc. X is an actual entity sampled from the corresponding ER song, video, artist, or album catalogs, and it is not in the training data, such that the model cannot infer the slot by simply "memorizing" it from the training data. We only report SongName (10K data) results in Music domain, and VideoName results in Video domain, due to the space limitation. For Local Search domain, utterances are in the form of "give me the direction to X", slot type of X could be PlaceName, DestinationName, etc. Note this data set is more ambiguous than the above one from real traffic in that "X" has multiple interpretations, whereas in real traffic users often add other words to help disambiguate, for example 'play music ...'.
We initialize the general feature weights to the same weights used in the baseline model. ER feature weights are set to smaller values (3 times less than the general feature weights). We find the expected SemER loss is less effective, so we set k 1 = 0.01, k 2 = 0.9, k 3 = 0.1. Besides, we use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001 and train the model for 10 epochs.

Results
Table 1 presents the NLU-ER reranker results for cross-domain (AD All) and in-domain (AD Music, SD) settings. All the results are the SemER metric relative improvements compared to the baseline reranker. We have DC, IC, NER scores as the general NLU features. NLU-ER reranker uses additional ER features: ER success, no match, and lexical similarity of different slot types, and the gazetteer selection algorithm is applied to retrieve the ER features. For the in-domain results, NLU-ER reranker has statistically significant improvement on both AD and SD. The improvement is more substantial on SD data, over 20%, which indicates ER features are more helpful when the utterances have ambiguity. Note there is some degradation in cross domain results on AD All when NLU-ER is used, due to the non-comparable score issue. After adding the loss term for ER no match feature, we observed additional improvements on both the in-domain and cross-domain settings.
As discussed earlier, because the scores from the baseline model are already well calibrated across domains, we use Yeo-Johnson transformation to match the domain score distribution back into the baseline score distribution. For Music domain, we use maximum likelihood estimation to get λ N LU = 1.088 and λ N LU ER = 1.104. With these two estimations, we map NLU-ER reranker scores back to obtain a score in the baseline reranker score distribution. Using this updated score, we can see the cross-domain SemER score is improved by 0.32% relatively. Among the improved cases, we found that the number of False Positive utterances is decreased by 7.37% relatively. For comparison, we also trained a univariate neural network regression model to predict the original reranker score given the NLU-ER reranker score. Although this method also yields improvements, we can see that power transformation has a better performance and is also easy to implement. Note again that the in-domain performance remains the same since these score mapping approaches do not affect the in-domain ranking order. We perform the same experiments for Video domain and Local Search domain as well, and have the similar observations.
To illustrate the effectiveness of our proposed NLU-ER reranker and analyze the reasons for performance improvement, we compare the generated 1-best hypothesis from the baseline model with our new reranker. For utterance "play hot chocolate by polar express", the correct type for "polar express" is album. The baseline model predicts "polar express" as an artist because it is not in the training set, and "Song by Artist" appears more frequently than "Song by Album". However, our model successfully selected this hypothesis ("polar express" is an album), since ER_SUCCESS signal is found from the ER album catalog but ER_NO_MATCH is found from ER artist catalog. Similarly, in another example "play a sixteen z" where "a sixteen z" is ambiguous and not in the training set, the baseline model predicts it as a song since utterances with SongName slot have higher frequency in the training data, whereas our model can correctly select ProgramName as the 1-best hypothesis using ER signals.

Conclusion
In this work, we proposed a framework to incorporate ER information in NLU reranking. We developed a new feature vector for the domain reranker by utilizing entity resolution features such as hits or no hits. To provide the ER features to the NLU reranker, we proposed an offline solution that distills the ER signals into a gazetteer. We also introduced a novel loss term based on ER signals to discourage the domain reranker from promoting hypotheses with ER no match and showed that it leads to better model performance. Finally, since domain rerankers predict the ranking scores independently, we introduced a score matching method to transform the NLU-ER model score distribution to make the final scores comparable across domains. Our experiments demonstrated that the Music domain reranker performance is significantly improved when ER information is incorporated in the feature vector. Also with score calibration, we achieve moderate gain for the cross-domain scenario.