Supervised Domain Enablement Attention for Personalized Domain Classification

In large-scale domain classification for natural language understanding, leveraging each user’s domain enablement information, which refers to the preferred or authenticated domains by the user, with attention mechanism has been shown to improve the overall domain classification performance. In this paper, we propose a supervised enablement attention mechanism, which utilizes sigmoid activation for the attention weighting so that the attention can be computed with more expressive power without the weight sum constraint of softmax attention. The attention weights are explicitly encouraged to be similar to the corresponding elements of the output one-hot vector, and self-distillation is used to leverage the attention information of the other enabled domains. By evaluating on the actual utterances from a large-scale IPDA, we show that our approach significantly improves domain classification performance


Introduction
Due to recent advances in deep learning techniques, intelligent personal digital assistants (IP-DAs) such as Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri have been widely used as real-life applications of natural language understanding (Sarikaya et al., 2016;Sarikaya, 2017).
In natural language understanding, domain classification is a task that finds the most relevant domain given an input utterance (Tur and de Mori, 2011).For example, "make a lion sound" and "find me an apple pie recipe" should be classified as ZooKeeper and AllRecipe, respectively.Recent IPDAs cover more than several thousands of diverse domains by including third-party developed domains such as Alexa Skills (Kumar et al., 2017;Kim et al., 2018a;Kim and Kim, 2018), Google Actions, and Cortana Skills, which makes domain classification to be a more challenging task.
Given a large number of domains, leveraging user's enabled domain information 1 has been shown to improve the domain classification performance since enabled domains reflect the user's context in terms of domain usage (Kim et al., 2018b).For an input utterance, Kim et al. (2018b) use attention mechanism so that a weighted sum of the enabled domain vectors are used as an input signal as well as the utterance vector.The enabled domain vectors and the attention weights are automatically trained in an end-to-end fashion to be helpful for the domain classification.
In this paper, we propose a supervised enablement attention mechanism for more effective attention on the enabled domains.First, we use logistic sigmoid instead of softmax as the attention activation function to relax the constraint that the weight sum over all the enabled domains is 1 to the constraint that each attention weight is between 0 and 1 regardless of the other weights (Martins and Astudillo, 2016;Kim et al., 2017).Therefore, all the attention weights can be very low if there are no enabled domains relevant to a groundtruth so that we can disregard the irrelevant enabled domains, and multiple attention weights can have high values when multiple enabled domains are helpful for disambiguating an input utterance.Second, we encourage each attention weight to be high if the corresponding enabled domain is a ground-truth domain and if otherwise, to be low, by a supervised attention method (Mi et al., 2016) so that the attention weights can be directly tuned for the downstream classification task.Third, we 1 Enabled domain information specifically refers to preferred or authenticated domains in Amazon Alexa, but it can be extended to other information such as the list of recently used domains.apply self-distillation (Furlanello et al., 2018) on top of the enablement attention weights so that we can better utilize the enabled domains that are not ground-truth domains but still relevant.
Evaluating on datasets obtained from real usage in a large-scale IPDA, we show that our approach significantly improves domain classification performance by utilizing the domain enablement information effectively.

Model
Figure 1 shows the overall architecture of the proposed model.
Given an input utterance, each word of the utterance is represented as a dense vector through word embedding followed by bidirectional long shortterm memory (BiLSTM) (Graves and Schmidhuber, 2005).Then, an utterance vector is composed by concatenating the last outputs of the forward LSTM and the backward LSTM. 2  To represent the domain enablement information, we obtain a weighted sum of domain enablement vector where the weights are calculated by logistic sigmoid function on top of the multiplicative attention (Luong et al., 2015) for the utterance vector and the domain enablement vectors.The attention weight of an enabled domain e is formu-2 We have also evaluated word vector summation, CNN (Kim, 2014), BiLSTM mean-pooling, and BiLSTM maxpooling (Conneau et al., 2017) as alternative utterance representation methods, but they did not show better performance on our task.lated as follows: where u is the utterance vector, v e is the enablement vector of enabled domain e, and σ is sigmoid function.Compared to conventional attention mechanism using softmax function, which constraints the sum of the attention weights to be 1, sigmoid attention has more expressive power, where each attention weight can be between 0 and 1 regardless of the other weights.We show that using sigmoid attention is actually more effective for improving prediction performance in Section 3.
The utterance vector and the weighted sum of the domain enablement vectors are concatenated to represent the utterance and the domain enablement as a single vector.Given the concatenated vector, a feed-forward neural network with a single hidden layer3 is used to predict the confidence score by logistic sigmoid function for each domain.
One issue of the proposed architecture is that the domain enablement can be trained to be a very strong signal, where one of the enabled domains would be the predicted domains regardless of the relevancy of the utterances to the predicted domains in many cases.To reduce this prediction bias, we use randomly sampled enabled domains instead of the correct enabled domains of an input utterance with 50% probability during training so that the domain enablement is used as an auxiliary signal rather than determining signal.During inference, we always use the correct domain enablements of the given utterances.
The main loss function of our model is formulated as binary log loss between the confidence score and the ground-truth vector as follows: where n is the number of all domains, o is an n-dimensional confidence score vector from the model, and y is an n-dimensional one-hot vector whose element corresponding to the position of the ground-truth domain is set to 1.

Supervised Enablement Attention
Attention weights are originally intended to be automatically trained in an end-to-end fashion (Bahdanau et al., 2015), but it has been shown that applying proper explicit supervision to the attention improves the downstream tasks such as machine translation given the word alignment and constituent parsing given annotations between surface words and nonterminals (Mi et al., 2016;Liu et al., 2016;Kamigaito et al., 2017).
We hypothesize that if the ground-truth domain is one of the enabled domains, the attention weight for the ground-truth domain should be high and vice versa.To apply this hypothesis in the model training as a supervised attention method, we formulate an auxiliary loss function as follows: where E is a set of enabled domains and a e is the attention weight for the enabled domain e.

Self-Distilled Attention
One issue of supervised attention in Section 2.1 is that enabled domains that are not ground-truth domains are encouraged to have lower attention weights regardless of their relevancies to the input utterances and the ground-truth domains.Distillation methods utilize not only the ground-truth but also all the output activations of a source model so that all the prediction information from the source model can be utilized for more effective knowledge transfer between the source model and the target model (Hinton et al., 2014).Selfdistillation, which trains a model leveraging the outputs of the source model with the same architecture or capacity, has been shown to improve the target model's performance with a distillation method (Furlanello et al., 2018).
We use a variant of self-distillation methods, where the model outputs at the previous epoch with the best dev set performance are used as the soft targets for the distillation, 4 so that the enabled domains that are not ground-truths can also be used for the supervised attention.While conventional distillation methods utilize softmax activations as the target values, we show that distillation on top of sigmoid activations is also effective without loss of generality.The loss function for the self-distillation on the attention weights is formulated as follows: where ãe is the attention weight of the model showing the dev set performance in the previous epochs.It is formulated as: where T is the temperature for sufficient usage of all the attention weights as the soft target.In this work, we set T to be 16, which shows the best dev set performance.
We have also evaluated soft-target regularization (Aghajanyan, 2017), where a weighted sum of the hard ground-truth target vector and the soft target vector is used as a single target vector, but it did not show better performance than selfdistillation.
All the described loss functions are added to compose a single loss function as follows: where α is a coefficient representing the degree of supervised enablement attention and β t denotes the degree of the self-distillation.We set α to be 0.01 in this work.Following Hu et al. (2016), β t = 1 − 0.95 t , where t denotes the current training epoch starting from 0 so that the hard ground-truth targets are more influential in the early epochs and the self-distillation is more utilized in the late epochs.4) but not with model ( 1) and ( 2).

Experiments
We evaluate our proposed model on domain classification leveraging enabled domains.The enabled domains can be a crucial disambiguating signal especially when there are multiple similar domains.For example, assume that the input utterance is "what's the weather" and there are multiple weather-related domains such as NewYorkWeather, AccuWeather, and WeatherChannel.
In this case, if WeatherChannel is included as an enabled domain of the current user, it is likely to be the most relevant domain to the user.

Datasets
Following the data collection methods used in Kim et al. (2018b), our models are trained using utterances with explicit invocation patterns.For example, given a user's utterance, "Ask {ZooKeeper} to {play peacock sound}," "play peacock sound" and ZooKeeper are extracted to compose a pair of the utterance and the groundtruth, respectively.In this way, we have generated train, development, and test sets containing 4.4M, 500K, and 500K utterances, respectively.All the utterances are from the usage log of Amazon Alexa and the ground-truth of each utterance is one of 1K frequently used domains.The average number of enabled domains per utterance in the test sets is 8.47.
One issue of this collected data sets is that the (Laine and Aila, 2017), but we just leverage the model outputs at the previous epoch rather than accumulating the outputs over multiple epochs.
ground-truth is included in the enabled domains for more than 90% of the utterances, where the ground-truths are biased to enabled domains. 5For more correct and unbiased evaluation of the models on the input utterances from real live traffic, we also evaluate the models on the same sized train, development, and test sets where the utterances are sampled to set the ratio of ground-truth inclusion in enabled domains to be 70%, which is closer to the ratio for actual input traffic.

Results
Table 1 shows the accuracies of our proposed models on the two test sets.We also show mean reciprocal rank (MRR) and top-3, accuracy6 which is meaningful when utilizing post reranker, but we do not cover reranking issues in this paper (Robichaud et al., 2014;Kim et al., 2018a).
From Table 1, we can first see that changing softmax attention to sigmoid attention significantly improves the performance.This means that having more expressive power for the domain enablement information by relaxing the softmax constraint is effective in terms of leveraging the domain enablement information for domain classification.Along with sigmoid attention, supervised attention leveraging ground-truth slightly improves the performance, and supervised attention combined with self-distillation shows significant performance improvement.It demon-strates that supervised domain enablement attention leveraging ground-truth enabled domains is helpful, and utilizing attention information from other enabled domains is synergistic.Kim et al. (2018b)'s model also adds a domain enablement bias vector to the final output, which is helpful when the ground-truth domain is one of the enabled domains.Such models ( 5) and ( 6) also show good performance for the test set where the ground-truth is one of the enabled domains with more than 90% probability.However, for the unbiased test set where the ground-truth is included in the enabled domains with a smaller probability, not adding the bias vector is shown to be better overall.
Table 2 shows sample utterances correctly predicted with model ( 4) but not with model ( 1) and (2).For the first two utterances, the groundtruths are included in the enabled domains, but there were only hundreds or fewer training instances whose ground-truths are CryptoPrice or Expedia.In these cases, we can see that model (1) attends to unrelated domains, model (2) attends to none of the enabled domains, but model (4), which uses supervised attention, is shown to attend to the ground-truth even without many training examples."find my phone" has a single enabled domain which is not a ground-truth.In this case, model (1) still fully attends to the unrelated domain because of softmax attention while model ( 2) and (4) do not highly attend to it so that the unrelated enabled domain is not impactive.

Implementation Details
The word vectors are initialized with off-the-shelf GloVe vectors (Pennington et al., 2014), and all the other model parameters are initialized with Xavier initialization (Glorot and Bengio, 2010).Each model is trained for 25 epochs and the parameters showing the best performance on the development set are chosen as the model parameters.We use ADAM (Kingma and Ba, 2015) for the optimization with the initial learning rate 0.0002 and the mini-batch size 128.We use gradient clipping, where the threshold is set to 5. We use a variant of LSTM, where the input gate and the forget gate are coupled and peephole connections are used (Gers and Schmidhuber, 2000;Greff et al., 2017).We also use variational dropout for the LSTM regularization (Gal and Ghahramani, 2016).All the models are implemented with DyNet (Neubig et al., 2017).

Conclusion
We have introduced a novel domain enablement attention mechanism improving domain classification performance utilizing domain enablement information more effectively.The proposed attention mechanism uses sigmoid attentions for more expressive power of the attention weights, supervised attention leveraging ground-truth information for explicit guidance of the attention weight training, and self-distillation for the attention supervision leveraging enabled domains that are not ground truth domains.Evaluating on utterances from real usage in a large-scale IPDA, we have demonstrated that our proposed model significantly improves domain classification performance by better utilizing domain enablement information.

Table 2 :
Sample utterances correctly predicted with model (