Domain Attention with an Ensemble of Experts

An important problem in domain adaptation is to quickly generalize to a new domain with limited supervision given K existing domains. One approach is to retrain a global model across all K + 1 domains using standard techniques, for instance Daumé III (2009). However, it is desirable to adapt without having to re-estimate a global model from scratch each time a new domain with potentially new intents and slots is added. We describe a solution based on attending an ensemble of domain experts. We assume K domain specific intent and slot models trained on respective domains. When given domain K + 1, our model uses a weighted combination of the K domain experts’ feedback along with its own opinion to make predictions on the new domain. In experiments, the model significantly outperforms baselines that do not use domain adaptation and also performs better than the full retraining approach.


Abstract
An important problem in domain adaptation is to quickly generalize to a new domain with limited supervision given K existing domains. One approach is to retrain a global model across all K + 1 domains using standard techniques, for instance Daumé III (2009). However, it is desirable to adapt without having to reestimate a global model from scratch each time a new domain with potentially new intents and slots is added. We describe a solution based on attending an ensemble of domain experts. We assume K domainspecific intent and slot models trained on respective domains. When given domain K + 1, our model uses a weighted combination of the K domain experts' feedback along with its own opinion to make predictions on the new domain. In experiments, the model significantly outperforms baselines that do not use domain adaptation and also performs better than the full retraining approach.

Introduction
An important problem in domain adaptation is to quickly generalize to a new domain with limited supervision given K existing domains. In spoken language understanding, new domains of interest for categorizing user utterances are added on a regular basis 1 . For instance, we may 1 A scenario frequently arising in practice is having a request for creating a new virtual domain targeting a specific application. One typical use case is that of building natural language capability through intent and slot modeling (without actually building a domain classifier) targeting a specific application.
add ORDERPIZZA domain and desire a domainspecific intent and semantic slot tagger with a limited amount of training data. Training only on the target domain fails to utilize the existing resources in other domains that are relevant (e.g., labeled data for PLACES domain with place name, location as the slot types), but naively training on the union of all domains does not work well since different domains can have widely varying distributions.
Domain adaptation offers a balance between these extremes by using all data but simultaneously distinguishing domain types. A common approach for adapting to a new domain is to retrain a global model across all K + 1 domains using well-known techniques, for example the feature augmentation method of Daumé III (2009) which trains a single model that has one domaininvariant component along with K + 1 domainspecific components each of which is specialized in a particular domain. While such a global model is effective, it requires re-estimating a model from scratch on all K + 1 domains each time a new domain is added. This is burdensome particularly in our scenario in which new domains can arise frequently.
In this paper, we present an alternative solution based on attending an ensemble of domain experts. We assume that we have already trained K domain-specific models on respective domains. Given a new domain K +1 with a small amount of training data, we train a model on that data alone but queries the K experts as part of the training procedure. We compute an attention weight for each of these experts and use their combined feedback along with the model's own opinion to make predictions. This way, the model is able to selectively capitalize on relevant domains much like in standard domain adaptation but without explicitly re-training on all domains together.
In experiments, we show clear gains in a domain adaptation scenario across 7 test domains, yielding average error reductions of 44.97% for intent classification and 32.30% for slot tagging compared to baselines that do not use domain adaptation. Moreover we have higher accuracy than the full re-training approach of Kim et al. (2016c), a neural analog of Daumé III (2009). 2 Related Work

Domain Adaptation
There is a venerable history of research on domain adaptation (Daume III and Marcu, 2006;Daumé III, 2009;Blitzer et al., 2006Blitzer et al., , 2007Pan et al., 2011) which is concerned with the shift in data distribution from one domain to another. In the context of NLP, a particularly successful approach is the feature augmentation method of Daumé III (2009) whose key insight is that if we partition the model parameters to those that handle common patterns and those that handle domainspecific patterns, the model is forced to learn from all domains yet preserve domain-specific knowledge. The method is generalized to the neural paradigm by Kim et al. (2016c) who jointly use a domain-specific LSTM and also a global LSTM shared across all domains. In the context of SLU, Jaech et al. (2016) proposed K domain-specific feedforward layers with a shared word-level LSTM layer across domains; Kim et al. (2016c) instead employed K + 1 LSTMs.  proposed to employ a sequence-to-sequence model by introducing a fictitious symbol at the end of an utterance of which tag represents the corresponding domain and intent.
All these methods require one to re-train a model from scratch to make it learn the correlation and invariance between domains. This becomes difficult to scale when there is a new domain coming in at high frequency. We address this problem by proposing a method that only calls K trained domain experts; we do not have to re-train these domain experts. This gives a clear computational advantage over the feature augmentation method.

Spoken Language Understanding
Recently, there has been much investment on the personal digital assistant (PDA) technology in in-dustry (Sarikaya, 2015;. Apples Siri, Google Now, Microsofts Cortana, and Amazons Alexa are some examples of personal digital assistants. Spoken language understanding (SLU) is an important component of these examples that allows natural communication between the user and the agent (Tur, 2006;El-Kahky et al., 2014). PDAs support a number of scenarios including creating reminders, setting up alarms, note taking, scheduling meetings, finding and consuming entertainment (i.e. movie, music, games), finding places of interest and getting driving directions to them (Kim et al., 2016a).
Naturally, there has been an extensive line of prior studies for domain scaling problems to easily scale to a larger number of domains: pretraining (Kim et al., 2015c), transfer learning (Kim et al., 2015d), constrained decoding with a single model (Kim et al., 2016a), multi-task learning (Jaech et al., 2016), neural domain adaptation (Kim et al., 2016c), domainless adaptation (Kim et al., 2016b), a sequence-to-sequence model , adversary domain training (Kim et al., 2017) and zero-shot learning Ferreira et al., 2015).

Method
We use an LSTM simply as a mapping φ : R d × R d → R d that takes an input vector x and a state vector h to output a new state vector h = φ(x, h). See Hochreiter and Schmidhuber (1997) for a detailed description. At a high level, the individual model consists of builds on several ingredients shown in Figure 1: character and word embedding, a bidirectional LSTM (BiLSTM) at a character layer, a BiLSTM at word level, and feedfoward network at the output.  Figure 1: The overall network architecture of the individual model.

Individual Model Architecture
Let C denote the set of character types and W the set of word types. Let ⊕ denote the vector concatenation operation. A wildly successful architecture for encoding a sentence (w 1 . . . w n ) ∈ W n is given by bidirectional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997;Graves, 2012). Our model first constructs a network over an utterance closely following Lample et al. (2016). The model parameters Θ associated with this BiLSTM layer are • Character embedding e c ∈ R 25 for each c ∈ C and induces a character-and context-sensitive word representation h i ∈ R 200 as for each i = 1 . . . n. These vectors can be used to perform intent classification or slot tagging on the utterance.

Intent Classification
We can predict the intent of the utterance using (h 1 . . . h n ) ∈ R 200 in (1) as follows. Let I denote the set of intent types. We introduce a single-layer feedforward network g i : R 200 → R |I| whose parameters are denoted by Θ i . We compute a |I|-dimensional vector h i and define the conditional probability of the correct intent τ as 2 For simplicity, we assume some random initial state vectors such as f C 0 and b C |w i |+1 when we describe LSTMs.

645
The intent classification loss is given by the negative log likelihood: where l iterates over intent-annotated utterances.
Slot Tagging We predict the semantic slots of the utterance using (h 1 . . . h n ) ∈ R 200 in (1) as follows. Let S denote the set of semantic types and L the set of corresponding BIO label types 3 that is, L = {B-e : e ∈ E} ∪ {I-e : e ∈ E} ∪ {O}. We add a transition matrix T ∈ R |L|×|L| and a singlelayer feedforward network g t : R 200 → R |L| to the network; denote these additional parameters by Θ t . The conditional random field (CRF) tagging layer defines a joint distribution over label sequences of y 1 . . . y n ∈ L of w 1 . . . w n as The tagging loss is given by the negative log likelihood: where l iterates over tagged sentences in the data. Alternatively, we can optimize the local loss:  (1) K domain experts + 1 target BiLSTM layer to induce a feature representation, (2) K domain experts + 1 target feedfoward layer to output pre-trained label embedding (3) a final feedforward layer to output an intent or slot. We have two separate attention mechanisms to combine feedback from domain experts. k = 1 . . . K. For each word w i , the model computes an attention weight for each domain k = 1 . . . K domains as in the simplest case. We also experiment with the bilinear function where B is an additional model parameter, and also the feedforward function i are obtained by using a softmax layer The weighted combination of the experts' feedback is given by and the model makes predictions by usinḡ h 1 . . .h n whereh These vectors replace the original feature vectors h i in defining the intent or tagging losses.

Domain Attention Variants
We also consider two variants of the domain attention architecture in Section 4.1.
Label Embedding In addition to the state vectors h (1) . . . h (K) produced by K experts, we further incorporate their final (discrete) label predictions using pre-trained label embeddings. We induce embeddings e y for labels y from all domains using the method of Kim et al. (2015d). At the i-th word, we predict the most likely label y (k) under the k-th expert and compute an attention weight as Then we compute an expectation over the experts' predictionsā and use it in conjunction withh i . Note that this makes the objective a function of discrete decision and thus non-differentiable, but we can still optimize it in a standard way treating it as learning a stochastic policy.
Selective Attention Instead of computing attention over all K experts, we only consider the top K ≤ K that predict the highest label scores. We only compute attention over these K vectors. We experiment with various values of K

Experiments
In this section, we describe the set of experiments conducted to evaluate the performance of our model. In order to fully assess the contribution of our approach, we also consider several baselines and variants besides our primary expert model.

Test domains and Tasks
To test the effectiveness of our proposed approach, we apply it to a suite of 7 Microsoft Cortana domains with 2 separate tasks in spoken language understanding: (1) intent classification and (2) slot (label) tagging. The intent classification task is a multi-class classification problem with the goal of determining to which one of the |I| intents a user utterance belongs within a given domain. The slot tagging task is a sequence labeling problem with the goal of identifying entities and chunking of useful information snippets in a user utterance. For example, a user could say "reserve a table at joeys grill for thursday at seven pm for five people". Then the goal of the first task would be to classify this utterance as "make reservation" intent given the places domain, and the goal of the second task would be to tag "joeys grill" as restaurant, "thursday" as date, "seven pm" as time, and "five" as number people.
The short descriptions on the 7 test domains are shown in Table 1. As the table shows, the test domains have different granularity and diverse semantics. For each personal assistant test domain, we only used 1000 training utterances to simulate scarcity of newly labeled data. The amount of development and test utterance was 100 and 10k respectively.
The similarities of test domains, represented by overlapping percentage, with experts or source domains are represented in Table 2. The intent overlapping percentage ranges from 30% on FITNESS domain to 70% on EVENTS, which averages out at 51.49%. And the slots for test domains overlaps more with those of source domains ranging from 60% on TV domain to 100% on both M-TICKET and TAXI domains, which averages out at 81.69%.  In testing our approach, we consider a domain adaptation (DA) scenario, where a target domain has a limited training data and the source domain has a sufficient amount of labeled data. We further consider a scenario, creating a new virtual domain targeting a specific scenario given a large inventory of intent and slot types and underlying models build for many different applications and scenarios. One typical use case is that of building natural language capability through intent and slot modeling (without actually building a domain classifier) targeting a specific application. Therefore, our experimental settings are rather different from previ-ously considered settings for domain adaptation in two aspects:

Experimental Setup
• Multiple source domains: In most previous works, only a pair of domains (source vs. target) have been considered, although they can be easily generalized to K > 2. Here, we experiment with K = 25 domains shown in Table 3.
• Variant output: In a typical setting for domain adaptation, the label space is invariant across all domains. Here, the label space can be different in different domains, which is a more challenging setting. See Kim et al. (2015d) for details of this setting.
For this DA scenario, we test whether our approach can effectively make a system to quickly generalize to a new domain with limited supervision given K existing domain experts shown in 3 .
In summary, our approach is tested with 7 Microsoft Cortana personal assistant domains across 2 tasks of intent classification and slot tagging. Below shows more detail of our baselines and variants used in our experiments.
Baselines: All models below use same underlying architecture described in Section 3.1 • TARGET: a model trained on a targeted domain without DA techniques.
• UNION: a model trained on the union of a targeted domain and 25 domain experts.
• DA: a neural domain adaptation method of Kim et al. (2016c) which trains domain specific K LSTMs with a generic LSTM on all domain training data.
Domain Experts (DE) variants: All models below are based on attending on an ensemble of 25 domain experts (DE) described in Section 4.1, where a specific set of intent and slots models are trained for each domain. We have two feedback from domain experts: (1) feature representation from LSTM, and (2) label embedding from feedfoward described in Section 4.1 and Section 4.2, respectively.
• DE B : DE without domain attention mechanism. It uses the unweighted combination of first feedback from experts like bag-of-word model.
• DE 1 : DE with domain attention with the weighted combination of the first feedbacks from experts.
• DE 2 : DE 1 with additional weighted combination of second feedbacks.
• DE S2 : DE 2 with selected attention mechanism, described in Section 4.2.
In our experiments, all the models were implemented using Dynet (Neubig et al., 2017) and were trained using Stochastic Gradient Descent (SGD) with Adam (Kingma and Ba, 2015)-an adaptive learning rate algorithm. We used the initial learning rate of 4 × 10 −4 and left all the other hyper parameters as suggested in Kingma and Ba (2015). Each SGD update was computed without a minibatch with Intel MKL (Math Kernel Library) 4 . We used the dropout regularization (Srivastava et al., 2014) with the keep probability of 0.4 at each LSTM layer.
To encode user utterances, we used bidirectional LSTMs (BiLSTMs) at the character level and the word level, along with 25 dimensional character embedding and 100 dimensional word embedding. The dimension of both the input and output of the character LSTMs were 25, and the dimensions of the input and output of the word LSTMs were 150 5 and 100, respectively. The dimension of the input and output of the final feedforward network for intent, and slot were 200 and the number of their corresponding task. Its activation was rectified linear unit (ReLU).
To initialize word embedding, we used word embedding trained from (Lample et al., 2016). In the following sections, we report intent classification results in accuracy percentage and slot results in F1-score. To compute slot F1-score, we used the standard CoNLL evaluation script 6

Results
We show our results in the DA setting where we had a sufficient labeled dataset in the 25 source domains shown in Table 3, but only 1000 labeled data in the target domain. The performance of the baselines and our domain experts DE variants are shown in Table 4. The top half of the table shows the results of intent classification and the results of slot tagging is in the bottom half.
The baseline which trained only on the target domain (TARGET) shows a reasonably good performance, yielding on average 87.7% on the intent classification and 83.9% F1-score on the slot tagging. Simply training a single model with aggregated utterance across all domains (UNION) brings the performance down to 77.4% and 75.3%. Using DA approach of Kim et al. (2016c) shows a significant increase in performance in all 7 domains, yielding on average 90.3% intent accuracy and 86.2%.
The DE without domain attention (DE B ) shows similar performance compared to DA. Using DE model with domain attention (DE 1 ) shows another increase in performance, yielding on average 90.9% intent accuracy and 86.9%. The performance increases again when we use both feature representation and label embedding (DE 2 ), yielding on average 91.4% and 88.2% and observe nearly 93.6% and 89.1% when using selective attention (DE S2 ). Note that DE S2 selects the appropriate number of experts per layer by evaluation on a development set.
The results show that our expert variant approach (DE S2 ) achieves a significant performance gain in all 7 test domains, yielding average error reductions of 47.97% for intent classification and 32.30% for slot tagging. The results suggest that our expert approach can quickly generalize to a new domain with limited supervision given K existing domains by having only a handful more data of 1k newly labeled data points. The poor performance of using the union of both source and target domain data might be due to the relatively very small size of the target domain data, overwhelmed by the data in the source domain. For example, a word such as "home" can be labeled as place type under the TAXI domain, but in the source domains can be labeled as either home screen under the PHONE domain or contact name under the CALENDAR domain.

Training Time
The Figure 3 shows the time required for training DE S2 and DA of Kim et al. (2016c) Figure 4: Learning curves in accuracy across all seven test domains as the number of expert domains increases.

Learning Curve
We also measured the performance of our methods as a function of the number of domain experts. For each test domain, we consider all possible sizes of experts ranging from 1 to 25 and we then take the average of the resulting performances obtained from the expert sets of all different sizes. Figure 4 shows the resulting learning curves for each test domain. The overall trend is clear: as the more expert domains are added, the more the test performance improves. With ten or more expert domains added, our method starts to get saturated achiev- ing more than 90% in accuracy across all seven domains.

Attention weights
From the heatmap shown in Figure 5, we can see that the attention strength generally agrees with common sense. For example, the M-TICKET and TAXI domain selected MOVIE and PLACES as their top experts, respectively.

Oracle Expert
Domain  The results in Table 5 show the intent classification accuracy of DE 2 when we already have the same domain expert in the expert pool. To simulate such a situation, we randomly sampled 1,000, 100, and 100 utterances from each domain as training, development and test data, respectively. In both ALARM and HOTEL domains, the trained models only on the 1,000 training utterances (TARGET) achieved only 70.1%and 65.2% in accuracy, respectively. Whereas, with our method (DE 2 ) applied, we reached almost the full training performance by selectively paying attention to the oracle expert, yielding 98.2% and 96.9%, respectively. This result again confirms that the behavior of the trained attention network indeed matches the semantic closeness between different domains.

Selective attention
The results in Table 6   number of experts in the pool. The rationale behind DE S2 is to alleviate the downside of soft attention, namely distributing probability mass over all items even if some are bad items. To deal with such issues, we apply a hard cut-off at top k domains. From the result, a threshold at top 3 or 5 yielded better results than that of either 1 or 25 experts. This matches our common sense that their are only a few of domains that are close enough to be of help to a test domain. Thus it is advisable to find the optimal k value through several rounds of experiments on a development dataset.

Conclusion
In this paper, we proposed a solution for scaling domains and experiences potentially to a large number of use cases by reusing existing data labeled for different domains and applications. Our solution is based on attending an ensemble of domain experts. When given a new domain, our model uses a weighted combination of domain experts' feedback along with its own opinion to make prediction on the new domain. In both intent classification and slot tagging tasks, the model significantly outperformed baselines that do not use domain adaptation and also performed better than the full re-training approach. This approach enables creation of new virtual domains through a weighted combination of domain experts' feedback reducing the need to collect and annotate the similar intent and slot types multiple times for different domains. Future work can include an extension of domain experts to take into account dialog history aiming for a holistic framework that can handle contextual interpretation as well.