Adversarial Adaptation of Synthetic or Stale Data

Two types of data shift common in practice are 1. transferring from synthetic data to live user data (a deployment shift), and 2. transferring from stale data to current data (a temporal shift). Both cause a distribution mismatch between training and evaluation, leading to a model that overfits the flawed training data and performs poorly on the test data. We propose a solution to this mismatch problem by framing it as domain adaptation, treating the flawed training dataset as a source domain and the evaluation dataset as a target domain. To this end, we use and build on several recent advances in neural domain adaptation such as adversarial training (Ganinet al., 2016) and domain separation network (Bousmalis et al., 2016), proposing a new effective adversarial training scheme. In both supervised and unsupervised adaptation scenarios, our approach yields clear improvement over strong baselines.


Abstract
Two types of data shift common in practice are 1. transferring from synthetic data to live user data (a deployment shift), and 2. transferring from stale data to current data (a temporal shift). Both cause a distribution mismatch between training and evaluation, leading to a model that overfits the flawed training data and performs poorly on the test data. We propose a solution to this mismatch problem by framing it as domain adaptation, treating the flawed training dataset as a source domain and the evaluation dataset as a target domain. To this end, we use and build on several recent advances in neural domain adaptation such as adversarial training (Ganin et al., 2016) and domain separation network (Bousmalis et al., 2016), proposing a new effective adversarial training scheme. In both supervised and unsupervised adaptation scenarios, our approach yields clear improvement over strong baselines.

Introduction
Spoken language understanding (SLU) systems analyze various aspects of a user query by classifying its domain, intent, and semantic slots. For instance, the query how is traffic to target in bellevue has domain PLACES, intent CHECK ROUTE TRAFFIC, and slots PLACE NAME: target and ABSOLUTE LOCATION: bellevue.
We are interested in addressing two types of data shift common in SLU applications. The first data shift problem happens when we transfer from synthetic data to live user data (a deployment shift). This is also known as the "cold-start" problem; a model cannot be trained on the real usage data prior to deployment simply because it does not exist. A common practice is to generate a large quantity of synthetic training data that mimics the expected user behavior. Such synthetic data is crafted using domain-specific knowledge and can be time-consuming. It is also flawed in that it typically does not match the live user data generated by actual users; the real queries submitted to these systems are different from what the model designers expect to see.
The second data shift problem happens when we transfer from stale data to current data (a temporal shift). In our use case, we have one set of training data from 2013 and wish to handle data from 2014-2016. This is problematic since the content of the user queries changes over time (e.g., new restaurant or movie names may be added). Consequently, the model performance degrades over time.
Both shifts cause a distribution mismatch between training and evaluation, leading to a model that overfits the flawed training data and performs poorly on the test data. We propose a solution to this mismatch problem by framing it as domain adaptation, treating the flawed training dataset as a source domain and the evaluation dataset as a target domain. To this end, we use and build on several recent advances in neural domain adaptation such as adversarial training (Ganin et al., 2016) and domain separation network (Bousmalis et al., 2016), proposing a new adversarial training scheme based on randomized predictions.
We consider both supervised and unsupervised adaptation scenarios (i.e., absence/presence of labeled data in the target domain). We find that unsupervised DA can greatly improve performance without requiring additional annotation. Super-vised DA with a small amount of labeled data gives further improvement on top of unsupervised DA. In experiments, we show clear gains in both deployment and temporal shifts across 5 test domains, yielding average error reductions of 74.04% and 41.46% for intent classification and 70.33% and 32.0% for slot tagging compared to baselines without adaptation.
2 Related Work

Domain Adaptation
Our work builds on the recent success of DA in the neural network framework. Notably, Ganin et al. (2016) propose an adversarial training method for unsupervised DA. They partition the model parameters into two parts: one inducing domainspecific (or private) features and the other domaininvariant (or shared) features.
The domaininvariant parameters are adversarially trained using a gradient reversal layer to be poor at domain classification; as a consequence, they produce representations that are domain agnostic. This approach is motivated by a rich literature on the theory of DA pioneered by Ben-David et al. (2007). We describe our use of adversarial training in Section 3.2.3. A special case of Ganin et al. (2016) is developed independently by Kim et al. (2016c) who motivate the method as a generalization of the feature augmentation method of Daumé III (2009). Bousmalis et al. (2016) extend the framework of Ganin et al. (2016) by additionally encouraging the private and shared features to be mutually exclusive. This is achieved by minimizing the dot product between the two sets of parameters and simultaneously reconstructing the input (for all domains) from the features induced by these parameters.
Both Ganin et al. (2016) and Bousmalis et al. (2016) discuss applications in computer vision. Zhang et al. (2017) apply the method of Bousmalis et al. (2016) to tackle transfer learning in NLP. They focus on transfer learning between classification tasks over the same domain ("aspect transfer"). They assume a set of keywords associated with each aspect and use these keywords to inform the learner of the relevance of each sentence for that aspect.

Spoken Language Understanding
Recently, there has been much investment on the personal digital assistant (PDA) technology in in-dustry (Sarikaya, 2015;. Apples Siri, Google Now, Microsofts Cortana, and Amazons Alexa are some examples of personal digital assistants. Spoken language understanding (SLU) is an important component of these examples that allows natural communication between the user and the agent (Tur, 2006;El-Kahky et al., 2014). PDAs support a number of scenarios including creating reminders, setting up alarms, note taking, scheduling meetings, finding and consuming entertainment (i.e. movie, music, games), finding places of interest and getting driving directions to them (Kim et al., 2016a).
Naturally, there has been an extensive line of prior studies for domain scaling problems to easily scale to a larger number of domains: pretraining (Kim et al., 2015c), transfer learning (Kim et al., 2015d), constrained decoding with a single model (Kim et al., 2016a), multi-task learning (Jaech et al., 2016), neural domain adaptation (Kim et al., 2016c), domainless adaptation (Kim et al., 2016b), a sequence-to-sequence model , domain attention (Kim et al., 2017) and zero-shot learning Ferreira et al., 2015).
All the above works assume that there are no any data shift issues which our work try to solve.

BiLSTM Encoder
We use an LSTM simply as a mapping φ : R d × R d → R d that takes an input vector x and a state vector h to output a new state vector h = φ(x, h). See Hochreiter and Schmidhuber (1997) for a detailed description.
Let C denote the set of character types and W the set of word types. Let ⊕ denote the vector concatenation operation. We encode an utterance using the wildly successful architecture given by bidirectional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997;Graves, 2012). The model parameters Θ associated with this BiLSTM layer are • Character embedding e c ∈ R 25 for each c ∈ C • Word embedding e w ∈ R 100 for each w ∈ W • Word LSTMs φ W f , φ W b : R 150 ×R 100 → R 100 Let w 1 . . . w n ∈ W denote a word sequence where word w i has character w i (j) ∈ C at position j. First, the model computes a character-sensitive word representation v i ∈ R 150 as and induces a character-and context-sensitive for each i = 1 . . . n. For convenience, we write the entire operation as a mapping BiLSTM Θ :

Unsupervised DA
In unsupervised domain adaptation, we assume labeled data for the source domain but not the target domain. Our approach closely follows the previous work on unsupervised neural domain adaptation by Ganin et al. (2016) and Bousmalis et al. (2016). We have three BiLSTM encoders described in Section 3.1: 1. Θ src : induces source-specific features 2. Θ tgt : induces target-specific features

Θ shd : induces domain-invariant features
We now define a series of loss functions defined by these encoders.
1 For simplicity, we assume some random initial state vectors such as f C 0 and b C |w i |+1 when we describe LSTMs.

Source Side Tagging Loss
The most obvious objective is to minimize the model's error on labeled training data for the source domain. Let w 1 . . . w n ∈ W be an utterance in the source domain annotated with labels y 1 . . . y n ∈ L. We induce Then we define the probability of tag y ∈ L for the i-th word as The tagging loss is given by the negative log likelihood where we iterate over annotated words (w i , y i ) on the source side.

Reconstruction Loss
Following previous works, we ground feature learning by reconstructing encoded utterances. Both Bousmalis et al. (2016) and Zhang et al. (2017) use mean squared errors for reconstruction, the former of image pixels and the latter of words in a context window. In contrast, we use an attention-based LSTM that fully re-generates the input utterance and use its log loss.
More specifically, let w 1 . . . w n ∈ W be an utterance in domain d ∈ {src, tgt}. We first use the relevant encoders as before danau et al., 2014) to define the probability of word w at each position i with state vector µ i−1 (where µ 0 =h n ): The reconstruction loss is given by the negative log likelihood where we iterate over words w i in both the source and target utterances. Ganin et al. (2016) propose introducing an adversarial loss to make shared features domaininvariant. This is motivated by a theoretical result of Ben-David et al. (2007) who show that the generalization error on the target domain depends on how "different" the source and the target domains are. This difference is approximately measured by

Adversarial Domain Classification Loss
where error(Θ) is the domain classification error using model Θ. It is assumed that the source and target domains are balanced so that inf Θ error(Θ) ≤ 1/2 and the difference lies in [0,2]. In other words, we want to make error(Θ) as large as possible in order to generalize well to the target domain. The intuition is that the more domain-invariant our features are, the easier it is to benefit from the source side training when testing on the target side. It can also be motivated as a regularization term (Ganin et al., 2016). Let w 1 . . . w n ∈ W be an utterance in domain d ∈ {src, tgt}. We first use the shared encoder It is important that we only use the shared encoder for this loss. Then we define the probability of domain d for the utterance as adv } denotes additional feedfoward parameters. The adversarial domain classification loss is given by the positive log likelihood where we iterate over domain-annotated utterances (w (i) , d (i) ).
Random prediction training While past work only consider using a negative gradient (Ganin et al., 2016;Bousmalis et al., 2016) or positive log likelihood (Zhang et al., 2017) to perform adversarial training, it is unclear whether these approaches are optimal for the purpose of "confusing" the domain predictor. For instance, minimizing log likelihood can lead to a model accurately predicting the opposite domain, compromising the goal of inducing domain-invariant representations. Thus we propose to instead optimize the shared parameters for random domain predictions. Specifically, the above loss is replaced with where d (i) is set to be src with probability 0.5 and tgt with probability 0.5. By optimizing for random predictions, we achieve the desired effect: the shared parameters are trained to induce features that cannot discriminate between the source and the target domains.

Non-Adversarial Domain Classification Loss
In addition to the adversarial loss for domaininvariant parameters, we also introduce a nonadversarial loss for domain-specific parameters. Given w 1 . . . w n ∈ W in domain d ∈ {src, tgt}, we use the private encoder It is important that we only use the private encoder for this loss. Then we define the probability of domain d for the utterance as The nonadversarial domain classification loss is given by the negative log likelihood where we iterate over domain-annotated utterances (w (i) , d (i) ).

Orthogonality Loss
Finally, following Bousmalis et al. (2016), we further encourage the domain-specific features to be mutually exclusive with the shared features by imposing soft orthogonality constraints. This is achieved as follows. Given an utterance w 1 . . . w n ∈ W in domain d ∈ {src, tgt}. We compute The orthogonality loss for this utterance is given by where we iterate over words i in both the source and target utterances.

Joint Objective
For unsupervised DA, we optimize with respect to all model parameters. In an online setting, given an utterance we compute its reconstruction, adversarial, orthogonality, and tagging loss if in the source domain, and take a gradient step on the sum of these losses.

Supervised DA
In supervised domain adaptation, we assume labeled data for both the source domain and the target domain. We can easily incorporate supervision in the target domain by adding L tag (Θ tgt , Θ shd , Θ tag ) to the unsupervised DA objective: We mention that the approach by Kim et al. (2016c) is a special case of this objective; they op- which is motivated as a neural extension of the feature augmentation method of Daumé III (2009).

Experiments
In this section, we conducted a series of experiments to evaluate the proposed techniques on datasets obtained from real usage.

Test Domains and Tasks
We test our approach on a suite of 5 Microsoft Cortana domains with 2 separate tasks in spoken language understanding: (1) intent classification and (2) slot (label) tagging. The intent classification task is a multi-class classification problem with the goal of determining to which one of the n intents a user utterance belongs conditioning on the given domain. The slot tagging task is a sequence labeling problem with the goal of identifying entities and chunking of useful information snippets in a user utterance. For example, a user could say reserve a table at joeys grill for thursday at seven pm for five people. Then the goal of the first task would be to classify this utterance as MAKE RESERVATION intent given the domain PLACES, and the goal of the second task would be to tag joeys grill as RESTAURANT, thursday as DATE, seven pm as TIEM, and five as NUMBER PEOPLE.

Experimental Setup
We consider 2 possible domain adaptation (DA) scenarios: (1) adaptation of an engineered dataset to a live user dataset and (2) adaptation of an old dataset to a new dataset. For the first DA scenario, we test whether our approach can effectively make a system adapt from experimental, engineered data to real-world, live data. We use synthetic data which domain experts manually create based on a given domain schema 2 before the system goes live as the engineered data. We use transcribed dataset from users' speech input as the live user data. For the second scenario, we test whether our approach can effectively make a system adapt over time. A large number of users will quickly generate a large amount of data, and the usage pattern could also change. We use annotation data over 1 month in 2013 (more precisely August of 2013) as our old dataset, and use the whole data between 2014 and 2016 as our new dataset regardless of whether the data type is engineered or live user. As we describe in the earlier sections, we consider both supervised and unsupervised DA. We apply our DA approach with labeled data in the target domain for the supervised setting and with unlabeled data for the unsupervised one. We give details of the baselines and variants of our approach below.
Unsupervised DA baselines and variants: In our experiments, all the models were implemented using Dynet (Neubig et al., 2017) and were trained using Stochastic Gradient Descent (SGD) with Adam (Kingma and Ba, 2015)-an adaptive learning rate algorithm. We used the initial learning rate of 4 × 10 −4 and left all the other hyper parameters as suggested in Kingma and Ba (2015). Each SGD update was computed without a minibatch with Intel MKL (Math Kernel Library) 3 . We used the dropout regularization (Srivastava et al., 2014) with the keep probability of 0.4. We encode user utterances with BiLSTMs as described in Section 3.1. We initialize word embeddings with pre-trained embeddings used by Lample et al. (2016). In the following sections, we report intent classification results in accuracy percentage and slot results in F1-score. To compute slot F1-score, we used the standard CoNLL evaluation script 4

Results: Unsupervised DA
We first show our results in the unsupervised DA setting where we have a labeled dataset in the source domain, but only unlabeled data in the target domain. We assume that the amount of data in both datasets is sufficient. Dataset statistics are shown in Table 2.
The performance of the baselines and our model variants are shown in Table 3. The left side of the table shows the results of the DA scenario of adapting from engineered data to live user data, and the baseline which trained only on the source domain (SRC) show a poor performance, yielding on average 48.5% on the intent classification and 42.7% F1-score on the slot tagging. Using our DA approach with a word-level decoder (DA W ) shows a significant increase in performance in all 5 test domains, yielding on average 82.2% intent accuracy and 80.5% slot F1-score. The performance increases further using the DA approach with a sentence-level decoder DA S , yielding on average 85.6% intent accuracy and 83.0% slot F1-score.
The right side of the table shows the results of the DA scenario of adapting from old to new data, and the baseline trained only on SRC also show Engineered → Live User Old → New  Domain  Train  Train*  Dev  Test  Train  Train*  Dev  Test  calendar  16904  50000  1878 10k  13165  13165 1463 10k  communication  32072  50000  3564 10k  12631  12631 1403 10k  places  23410  50000  2341 10k  21901  21901 2433 10k  reminder  19328  50000  1933 10k  16245  16245 1805 10k  weather  20794  50000  2079 10k  15575  15575 1731 10k  AVG  23590  50000  2359 10k  15903 15903 1767 10k    Our DA approach variants yield average error reductions of 72.04% and 79.71% for intent classification and 70.33% and 80.99% for slot tagging. The results suggest that our DA approach can quickly make a model adapt from synthetic data to real-world data and from old data to new data with the additional use of only 2 to 2.5 more data from the target domain. Aside from the performance boost itself, the approach shows even more power since the new data from the target down do not need to be labeled and it only requires collecting a little more data from the target domain. We note that the model development sets were created only from the source domain for a fully unsupervised setting. But having the development set from the target domain shows even more boost in performance although not shown in the results, and labeling only the development set from the target domain is relatively less expensive than labeling the whole dataset.

Results: Supervised DA
Second, we show our results in the supervised DA setting where we have a sufficient amount of labeled data in the source domain but relatively insufficient amount of labeled data in the target domain. Having more labeled data in the target domain would most likely help with the performance, but we intentionally made the setting more disadvantageous for our DA approach to better simulate real-world scenarios where there is usually lack of resources and time to label a large amount of new data. For each personal assistant test domain, we only used 1000 training utterances to simulate scarcity of newly labeled data, and dataset statistics are shown in Table 2. Unlike the unsupervised DA scenario, here we used the development sets created from the target domain shown in Table 4.
The left side of Table 5 shows the results of the supervised DA approach of adapting from engineered data to live user data. The baseline trained only on the source (SRC) shows on average 48.5% intent accuracy and 42.7% slot F1-score. Training only on the target domain (TGT) increases the performance to 71.3% and 65.0%, but training on the union of the source and target domains (Union) again brings the performance down to 48.7% and 42.3%. As shown in the unsupervised setting, using our DA approach (DA) shows significant performance increase in all 5 test domains, yielding on average 81.7% intent accuracy and 76.2% slot tagging. The DA approach with adversary domain training (DA A ) shows similar performance compared to that of DA, and performance shows more increase when using our DA approach with sufficient unlabeled data 5 (DA U ), yielding on average 83.6% and 77.6%. For the second scenario of adapting from old to new dataset, the results show a very similar trend in performance.
The results show that our supervised DA (DA) approach also achieves a significant performance gain in all 5 test domains, yielding average error reductions of 68.18% and 51.35% for intent classification and 60.90% and 50.09% for slot tagging. The results suggest that an effective domain adaptation can be done using the supervised DA by having only a handful more data of 1k newly labeled data points. In addition, having both a small amount of newly labeled data combined with sufficient unlabeled data can help the models perform even better. The poor performance of using the union of both source and target domain data might be due to the relatively very small size of the target domain data, overwhelmed by the data in the source domain.  The impact on the performance of two different adversarial classification losses are shown in Table  6. RAND represents the unsupervised DA model with sentence-level decoder (DA S ) using random prediction loss. The ADV shows the performance of same model using the adversarial loss of Ganin et al. (2016) as described in 3.2.3. Unfortunately, in the deployment shift scenario, using the adversarial loss fails to provide any improvement on intent classification accuracy and slot tagging F1 score, achieving 82.5% intent accuracy and 79.8% slot F1 score. These results align with our hypothesis that the adversarial loss using does not confuse the classifier sufficiently.  The results in shown in Table 7 show Proxy Adistance (Ganin et al., 2016) to check if our adversary domain training generalize well to the target domain. The distance between two datasets is computed bŷ

Proxy
where ε is a generalization error in discriminating between the source and target datasets. The range ofd A distance is between 0 and 2.0. 0 is the best case where adversary training successfully fake shared encoder to predict domains. In other words, thanks to adversary training our model make the domain-invariant features in shared encoder in order to generalize well to the target domain.

Vocabulary distance between engineered data and live user data
The results in shown in Table 8 show the discrepancy between two datasets. We measure the degree of overlap between vocabulary V employed by the two datasets. We simply take the Jaccard coefficient between the two sets of such vocabulary: where V s is the set of vocabulary in source s domain, and V t is the corresponding set for target t domain and JC(A, B) = |A∩B| |A∪B| is the Jaccard coefficient, measuring the similarity of two sets. The distance d V is the high it means that they are not shared with many words. Overall, the distance between old and new dataset are still far and the number of overlapped are small, but better than live user case.

Conclusion
In this paper, we have addressed two types of data shift common in SLU applications: 1. transferring from synthetic data to live user data (a deployment shift), and 2. transferring from stale data to current data (a temporal shift). Our method is based on domain adaptation, treating the flawed training dataset as a source domain and the evaluation dataset as a target domain. We use and build on several recent advances in neural domain adaptation such as adversarial training and domain separation network, proposing a new effective adversarial training scheme based on randomized predictions. In both supervised and unsupervised adaptation scenarios, our approach yields clear improvement over strong baselines.