Efficient Large-Scale Domain Classification with Personalized Attention

In this paper, we explore the task of mapping spoken language utterances to one of thousands of natural language understanding domains in intelligent personal digital assistants (IPDAs). This scenario is observed for many mainstream IPDAs in industry that allow third parties to develop thousands of new domains to augment built-in ones to rapidly increase domain coverage and overall IPDA capabilities. We propose a scalable neural model architecture with a shared encoder, a novel attention mechanism that incorporates personalization information and domain-specific classifiers that solves the problem efficiently. Our architecture is designed to efficiently accommodate new domains that appear in-between full model retraining cycles with a rapid bootstrapping mechanism two orders of magnitude faster than retraining. We account for practical constraints in real-time production systems, and design to minimize memory footprint and runtime latency. We demonstrate that incorporating personalization results in significantly more accurate domain classification in the setting with thousands of overlapping domains.


Introduction
Intelligent personal digital assistants (IPDAs) are one of the most advanced and successful applications that have spoken language understanding (SLU) or natural language understanding (NLU) capabilities. Many IPDAs have recently emerged in industry including Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana (Sarikaya, 2017). IPDAs have tradition-ally supported only tens of well-separated domains, each defined in terms of a specific application or functionality such as calendar and local search (Tur and de Mori, 2011;. To rapidly increase domain coverage and extend capabilities, some IPDAs have released Software Development Toolkits (SDKs) to allow third-party developers to promptly build and integrate new domains, which we refer to as skills henceforth. Amazon's Alexa Skills Kit (Kumar et al., 2017a), Google's Actions and Microsoft's Cortana Skills Kit are all examples of such SDKs. Alexa Skills Kit is the largest of these services and hosts over 25,000 skills.
For IPDAs, finding the most relevant skill to handle an utterance is an open scientific and engineering challenge for three reasons. First, the sheer number of potential skills makes the task difficult. Unlike traditional systems that have on the order of 10-20 built-in domains, large-scale IP-DAs can have 1,000-100,000 skills. Second, the number of skills rapidly expands with 100+ new skills added per week compared to 2-4 built-in domain launches per year in traditional systems. Large-scale IPDAs should be able to accommodate new skills efficiently without compromising performance. Third, unlike traditional built-in domains that are carefully designed to be disjoint, skills can cover overlapping functionalities. For instance, there are over 50 recipe skills in Alexa that can handle recipe-related utterances.
One simple solution to this problem has been to require an utterance to explicitly mention a skill name and follow a strict invocation pattern as in "Ask {Uber} to {get me a ride}." However, it significantly limits users' ability to interact with IPDAs naturally. Users have to remember skill names and invocation patterns, and it places a cognitive burden on users who tend to forget both. Skill discovery is difficult with a pure voice user interface, it is hard for users to know the capabilities of thousands of skills a priori, which leads to lowered user engagement with skills and ultimately with IPDAs. In this paper, we propose a solution that addresses all three practical challenges without requiring skill names or invocation patterns. Our approach is based on a scalable neural model architecture with a shared encoder, a skill attention mechanism and skill-specific classification networks that can efficiently perform largescale skill classification in IPDAs using a weakly supervised training dataset. We will demonstrate that our model achieves a high accuracy on a manually transcribed test set after being trained with weak supervision. Moreover, our architecture is designed to efficiently accommodate new skills that appear in-between full model retraining cycles. We keep practical constraints in mind and focus on minimizing memory footprint and runtime latency, while ensuring architecture is scalable to thousands of skills, all of which are important for real-time production systems. Furthermore, we investigate two different ways of incorporating user personalization information in the model, our naive baseline method adds the information as a 1bit flag in the feature space of the skill-specific networks, the personalized attention technique computes a convex combination of skill embeddings for the user's enabled skills and significantly outperforms the naive personalization baseline. We show the effectiveness of our approach with extensive experiments using 1,500 skills from a deployed IPDA system.

Related Work
Traditional multi-domain SLU/NLU systems are designed hierarchically, starting with domain classification to classify an incoming utterance into one of many possible domains, followed by further semantic analysis with domain-specific intent classification and slot tagging (Tur and de Mori, 2011). Traditional systems have typically been limited to a small number of domains, designed by specialists to be well-separable. Therefore, domain classification has been considered a less complex task compared to other semantic analysis such as intent and slot predictions. Traditional domain classifiers are built using simple linear models such as Multinomial Logistic Regression or Support Vector Machines in a one-versusall setting for multi-class prediction. The models typically use word n-gram features and also those based on static lexicon match, and there have been several recent studies applying deep learning techniques (Xu and Sarikaya, 2014).
There is also a line of prior work on enhancing sequential text classification or tagging. Hierarchical character-to-word level LSTM (Hochreiter and Schmidhuber, 1997) architectures similar to our models have been explored for the Named Entity Recognition task by Lample et al. (2016). Character-informed sequence models have also been explored for simple text classification with small sets of classes by Xiao and Cho (2016). Joulin et al. (2016) explored highly scalable text classification using a shared hierarchical encoder, but their hierarchical softmax-based output formulation is unsuitable for incremental model updates. Work on zero-shot domain classifier expansion by Kumar et al. (2017b) struggled to rank incoming domains higher than training domains. The attention-based approach of Kim et al. (2017d) does not require retraining from scratch, but it requires keeping all models stored in memory which is computationally expensive. Multi-Task learning was used in the context of SLU by Tur (2006) and has been further explored using neural networks for phoneme recognition (Seltzer and Droppo, 2013) and semantic parsing (Fan et al., 2017;Bapna et al., 2017). There have been many other pieces of prior work on improving NLU systems with pre-training (Kim et al., 2015b;Kim et al., 2017e), multi-task learning (Zhang and Wang, 2016;Liu and Lane, 2016;Kim et al., 2017b), transfer learning (El-Kahky et al., 2014;Kim et al., 2015a,c;Chen et al., 2016a;Yang et al., 2017), domain adaptation Jaech et al., 2016;Liu and Lane, 2017;Kim et al., 2017d,c) and contextual signals (Bhargava et al., 2013;Chen et al., 2016b;Hori et al., 2016;Kim et al., 2017a).

Weakly Supervised Training Data Generation
Our model addresses the domain classification task in SLU systems. In traditional IPDA systems, these domains are hand-crafted by experts to be well separable and can easily be annotated by humans because they are small in number. The emergence of self-service SLU results in a large number of potentially mutually overlapping SLU domains. This means that eliciting large volumes of high quality human annotations to train our model is no longer feasible, and we cannot assume that domains are designed to be well separable. Instead we can generate training data by adopting the weak supervision paradigm introduced by (Hoffmann et al., 2011), which proposes using heuristic labeling functions generate large numbers of noisy data samples. Clean data generation with weak supervision is a challenging problem, so we address it by decomposing it into two simpler problems, of candidate generation and noise suppression, however it remains important for our model to be noise robust.

Data Programming
The key insight of the Data Programming approach is that O(1) simple labeling functions can be used to approximate O(n) human annotated data points with much less effort. We adopt the formalism used by (Ratner et al., 2016) to treat each of instance data generation rule as a rich generative model, defined by a labeling function λ and describe different families of labeling functions. Our data programming pipeline is analogous to the noisy channel model proposed for spelling correction by (Kernighan et al., 1990), and consists of a set of candidate generation and noise detection functions.
where µ and s i represent utterances and the ith skill respectively. P (s i |µ) the probability of a skill being valid for an utterance is approximated by simple functions that act as candidate data generators λ g ∈ Λ g based on recognitions produced by a family of query patterns λ q ∈ Λ q . P (µ) is represented by a family of simple functions that act as noise detectors λ n ∈ Λ n , which mark utterances as likely being noise.
We apply the technique to the query logs of a popular IPDA, which has support for personalized third party domains. Looking at the structure of utterances that match query pattern λ q , each utterance of form "Ask {Uber} to {get me a car}" can be considered as being parametrized by the underlying latent command µ z , that is "Get me a car", a target domain corresponding to service s t , which in this case is Uber and the query recognition pattern λ q , in this case "Ask {s t } to {µ z }". Next we assume that the distribution of latent commands over domains are independent of the query pattern.
Making this simple distributional approximation allows us to generate a large number of noisy training samples. The family of generator functions λ g ∈ Λ g is thus defined such that u z = λ i g (µ, λ i q ) Noise Reduction The distribution defined above contains a large number of noisy positive samples. Related to P (µ) in the noisy channel in the spell correction context, we defined a small family of heuristic noise detection functions λ n ∈ Λ n that discards training data instances that are not likely to be well formed. For instance, • λ 1 n requires u to contain a minimum threshold of information by removing those with µ z that has token length fewer than 3. Utterances shorter than this mostly consist of nonactionable commands.
• λ 2 n discards all data samples below a certain threshold of occurrences in live traffic, since utterances that are rarely observed are more likely to be ASR errors or unnatural.
• λ 3 n discards the data samples for a domain if they come from an overly broad pattern with a catch-all behavior.
• λ 4 n discards utterances that belong to shared intents provided by the SLU SDK.
The end result of this stage is to retain utterances such as 'call me a cab' from 'Ask Uber to call me a cab' but discard 'Boston' from 'Ask Accuweather for Boston'. One can easily imagine extending this framework with other high recall noise detectors, for example, using language models to discard candidates that are unlikely to be spoken sentences.

Model Architecture
Our model consists of a shared encoder network consisting of an orthography-sensitive hierarchical LSTM encoder that feeds into a set of domain specific classification layers trained to make a binary decision for each output label.
Our main novel contribution is the extension of this architecture with a personalized attention mechanism which uses the attention mechanism of (Bahdanau et al., 2014) to attend to memory locations corresponding to the specific domains enabled by a user, and allows the system to learn semantic representations of each domain. As we will show, incorporating personalization features is key to disambiguating between multiple overlapping domains 1 , and the personalized attention mechanism outperforms more naive forms of personalization. The personalized attention mechanism first computes an attention weight for each of enabled domains, performs a convex combination to compute a context vector and then concatenates this vector to the encoded utterance before the final domain classification. Figure 1 depicts the model in detail.
Our model can efficiently accommodate new domains not seen during initial training by keeping the shared encoder frozen, bootstrapping a domain embedding based on existing ones, then optimizing a small number of network parameters corresponding to domain-specific classifier, which is orders of magnitude faster and more data efficient than retraining the full classifier.
We make design decisions to ensure that our model has a low memory and latency footprint. We avoid expensive large vocabulary matrix multiplications on both the input and output stages, and instead use a combination of character embeddings and word embeddings in the input stage. 2 The output matrix is lightweight because each domain-specific classifier is a matrix of only 200×2 parameters. The inference task can be trivially parallelized across cores since there's no requirement to compute a partition function across a high-dimensional softmax layer, which is the slowest component of large label multiclass neural networks. Instead, we achieve comparability between the probability scores generated by individual models by using a customized loss formulation. 3 Shared Encoder First we describe our shared hierarchical utterance encoder. Our hierarchical character to word to utterance design is motivated by the need to make the model operate on an open vocabulary in terms of words and to make it robust to small changes in orthography resulting from fluctuations in the upstream ASR system, all while avoiding expensive large matrix multiplications associated with one-hot word encoding in large vocabulary systems.
We denote an LSTM simply as a mapping φ : R d × R d → R d that takes a d dimensional input vector x and a d dimensional state vector h to output a new d dimensional state vector h = φ(x, h). Let C denote the set of characters and W the set of words in a given utterance. Let ⊕ denote the vector concatenation operation. We encode an utterance using BiLSTMs, and the model parameters Θ associated with this BiLSTM layer are • Char embeddings e c ∈ R 25 for each c ∈ C . . w n ∈ W denote a word sequence where word w i has character w i (j) ∈ C at position j. First, the model computes a character-sensitive word representation v i ∈ R 150 as for each i = 1 . . . n. 4 These word representation vectors are encoded by forward and backward and induces a character and context-sensitive word representation h i ∈ R 200 as for each i = 1 . . . n. For convenience, we write the entire operation as a mapping BiLSTM Θ : Domain Classification Our Multitask domain classification formulation is motivated by a desire to avoid computing the full partition function during test time, which tends to be the slowest component of a multiclass neural network classifer, as has been documented before by (Jozefowicz et al., 2016) and (Mikolov et al., 2013), amongst others. However, we also want access to reliable probability estimates instead of raw scores -we accomplish this by constructing a custom loss function. During training, each domain classifier receives in-domain (IND) and out-of-domain (OOD) utterances, and we adapt the one-sided selection mechanism of (Kubat et al., 1997) to prevent OOD utterances from overpowering IND utterances, thus an utterance in a domain d ∈ D is considered as an IND utterance in the viewpoint of domain d and OOD for all other domains.
We first use the shared encoder to compute the utterance representationh as previously described. Then we define the probability of domaind for the utterance by mappingh to a 2-dimensional output vector with a linear transformation for each domaind as where σ is scaled exponential linear unit (SeLU) for normalized activation outputs (Klambauer et al., 2017)

and [zd] IN D and [zd] OOD denote the values in the IND and OOD position of vector zd.
We define the joint domain classification loss L D as the summation of positive (L P ) and negative (L N ) class loss functions 5 : Where k is the total number of domains. We divide the second term by k − 1 so that L P and L N are balanced in terms of the ratio of the training examples for a domain to those for other domains.
This Multitask formulation enables us to extend the model for new incoming domains without impacting the relative scores for the existing domains, it also outperforms the standard softmax in terms of accuracy on our task.

Personalized Attention
We explore encoding a user's domain preferences in two ways. Our baseline method is a 1-bit flag that is appended to the input features of each domain-specific classifier. Our novel personalized attention method induces domain embeddings by supervising an attention mechanism to attend to a user's enabled domains with different weights depending on their relevance. We hypothesize that attention enables the network learn richer representations of user preferences and domain co-occurrence features.
Let e D (d) ∈ R 100 andh ∈ R 100 denote the domain embeddings for domaind and the utterance representation calculated by Eq. (1), respectively. The domain attention weights for a given user u who has a preferred domain list d (u) = d (u) 1 , . . . ,d (u) k are calculated by the dot-product operation, The final, normalized attention weightsā are obtained after normalization via a softmax layer, The weighted combination of domain embeddings isS Finally the two representations of enabled domains, namely the attention model and 1-bit flag are then concatenated with the utterance representation and used to make per-domain predictions via domain-specific affine transformations: where I(d ∈ enabled) is a 1-bit indicator for whether the domain is enabled by the user or not. In this way we can ascertain whether the two personalization signals are complementary via an ablation study.

Experiments
In this section we aim to demonstrate the effectiveness of our model architecture in two settings. First, we will demonstrate that attention based personalization significantly outperforms the baseline approach. Secondly, we will show that our model new domain bootstrapping procedure results in accuracies comparable to full retraining while requiring less than 1% of the orignal training time.

Experimental Setup
Weak: This is a weakly supervised dataset was generated by preprocessing utterances with strict invocation patterns according to the setup mentioned in Section 3. The dataset consists of 5.34M utterances from 637,975 users across 1,500 different skills. Since we are interested in capturing the temporal effects of the dataset as well as personalization effects, we partitioned the data based both on user and time. Our core training data for the experiments in this paper was drawn from one month of live usage, the validation data came from the WEAK Mturk Top-1 Top-3 Top-5 Top-1 Top-3 Top-5  Binary   next 15 days of usage, and the test data came from the subsequent 15 days. The training, validation and test sets are user-independent, and each user belongs to only one of the three sets to ensure no leakage of information.
MTurk: Since the Weak dataset is generated by weak supervision, we verified the performance of our approach with human generated utterances. A random sample of 12,428 utterances from the test partition of users were presented to 300 human judges, who were asked to produce two natural ways to issue the same command. This dataset is treated as a representative clean held out test set on which we can observe the generalization of our weakly supervised training and validation data to natural language.
New Skills: In order to simulate the scenario in which new skills appear within a week between model updates, we selected 250 new skills which do not overlap with the skills in the Weak dataset. The vocabulary size of 1,500 skills is 200K words, and on average, 5% of the vocabulary for new skills is not covered. We randomly sampled 4,000 unique utterances for each skill using the same weak supervision method, and split them into 3,000 utterances for training and 1,000 for testing.

Results and Discussion
Generalization from Weakly Supervised to Natural Utterances We first show the progression of model performance as we add more components to show their individual contribution. Sec-ondly, we show that training our models on a weakly supervised dataset can generalize to natural speech by showing their test performance on the human-annotated test data. Finally, we compare two personalization strategies.
The full results are summarized in Table 1, which shows the top-N test results separately for the Weak dataset (weakly supervised) and MTurk dataset (human-annotated). We report top-N accuracy to show the potential for further re-ranking or disambiguation downstream. For top-1 results on the Weak dataset, using a separate binary classifier for each domain (Binary) shows a prediction accuracy at 78.29% and using a softmax layer on top of the shared encoder (MultiClass) shows a comparable accuracy at 78.58%. The performance shows a slight improvement when using the Multitask neural loss structure, but adding personalization signals to the Multitask structure showed a significant boost in performance. We noted the large difference between the 1-bit and attention architecture. At 94.83% accuracy, attention resulted in 35.6% relative error reduction over the 1-bit baseline 91.97% on the Weak validation set and 23.25% relative on the MTurk test set. We hypothesize that this may be because the attention mechanism allows the model to focus on complementary features in case of overlapping domains as well as learning domain co-occurrence statistics, both of which are not possible with the simple 1-bit flag.
When both personalization representations were combined, the performance peaked at 95.19% for the Weak dataset and a more modest 89.65% for the MTurk dataset. The improvement trend is extremely consistent across all top-N results for both of the Weak and MTurk datasets across all experiments. The disambiguation task is complex due to similar and overlapping skills, but the results suggest that incorporating personalization signals equip the models with much better discriminative power. The results also suggest that the two mechanisms for encoding personalization provide a small amount of complementary information since combining them together is better than using them individually. Although the performance on the Weak dataset tends to be more optimistic, the best performance on the humanannotated test data is still close to 90% for top-1 accuracy, which suggests that training our model with the samples derived from the invocation patterns can generalize well to natural utterances.   Table 2. Adapting a new skill is two orders of magnitude faster (30.34 seconds) than retraining the model (5300.18 seconds) while achieving 94.03% accuracy which is comparable to 94.58% accuracy of full retraining. The first two techniques can also be easily parallelized unlike the Refresh configuration.
Behavior of Attention Mechanism Our expectation is that the model is able to learn to attend the relevant skills during the inference process.
To study the behavior of the attention layer, we compute the top-N prediction accuracy based on the most relevant skills defined by the attention weights. In this experiment, we considered the subset of users who had enabled more than 20 domains to exclude trivial cases 7 . The results are shown in Table 3. When the model attends to the entire set of 1500 (Full), the top-5 prediction accuracy is 20.41%, which indicates that a large number of skills can process the utterance, and thus it is highly likely to miss the correct one in the top-5 predictions. This ambiguity issue can be significantly improved by users' enabled do-7 Thus, the random prediction accuracy on enabled domains is less than 5% and across the Full domain list is 0.066%  Table 3: Top-N prediction accuracy (%) on the full skill set (Full) and only enabled skills (Enabled). main lists as proved by the accuracies (Enabled): 85.62% for top-1, 96.15% for top-3, and 98.06% for top-5. 8 Thus the attention mechanism can thus be viewed as an initial soft selection which is then followed by a fine-grained selection at the classification stage.
End-to-End User Evaluation All intermediate metrics on this task are proxies to a human customer's eventual evaluation. In order to assess the user experience, we need to measure its end-toend performance. For a brief end-to-end system evaluation, 983 utterances from 283 domains were randomly sampled from the test set in the largescale IPDA setting. 15 human judges (male=12, female=3) rated the system responses, 1 judge per utterance, on a 5-point Likert scale with 1 being Terrible and 5 being Perfect. The judgment score of 3 or above was taken as SUCCESS and 2 or below as DEFECT. The end-to-end SUCCESS rate, thus user satisfaction, was shown to be 95.52%. The discrepancy between this score and the score produced on MTurk dataset indicates that even in cases in which the model makes classification mistakes, some of these interpretations remain perceptually meaningful to humans.

Conclusions
We have described a neural model architecture to address large-scale skill classification in an IPDA used by tens of millions of users every day. We have described how personalization features and an attention mechanism can be used for handling ambiguity between domains. We have also shown that the model can be extended efficiently and incrementally for new domains, saving multiple orders of magnitude in terms of training time. The model also addresses practical constraints of having a low memory footprint, low latency and being easily parallelized, all of which are important characteristics for real production systems.     learned using the semantic content as well as personalization signals, so we hypothesize clusters like this may be capturing user tendencies to enable these domains in a correlated manner.