New Transfer Learning Techniques for Disparate Label Sets

In natural language understanding (NLU), a user utterance can be labeled differently depending on the domain or application (e.g., weather vs. calendar). Standard domain adaptation techniques are not directly applicable to take advantage of the existing annotations because they assume that the label set is invariant. We propose a solution based on label embeddings induced from canonical correlation analysis (CCA) that reduces the problem to a standard domain adaptation task and allows use of a number of transfer learning techniques. We also introduce a new transfer learning technique based on pretraining of hidden-unit CRFs (HUCRFs). We perform extensive experiments on slot tagging on eight personal digital assistant domains and demonstrate that the proposed methods are superior to strong baselines.


Introduction
The main goal of NLU is to automatically extract the meaning of spoken or typed queries. In recent years, this task has become increasingly important as more and more speech-based applications have emerged. Recent releases of personal digital assistants such as Siri, Google Now, Dragon Go and Cortana in smart phones provide natural language based interface for a variety of domains (e.g. places, weather, communications, reminders). The NLU in these domains are based on statistical machine learned models which require annotated training data. Typically each domain has its own schema to annotate the words and queries. However the meaning of words and utterances could be different in each domain. For example, "sunny" is considered a weather condition in the weather domain but it may be a song title in a music domain. Thus every time a new application is developed or a new domain is built, a significant amount of resources is invested in creating annotations specific to that application or domain.
One might attempt to apply existing techniques (Blitzer et al., 2006;Daumé III, 2007) in domain adaption to this problem, but a straightforward application is not possible because these techniques assume that the label set is invariant.
In this work, we provide a simple and effective solution to this problem by abstracting the label types using the canonical correlation analysis (CCA) by Hotelling (Hotelling, 1936) a powerful and flexible statistical technique for dimensionality reduction. We derive a low dimensional representation for each label type that is maximally correlated to the average context of that label via CCA. These shared label representations, or label embeddings, allow us to map label types across different domains and reduce the setting to a standard domain adaptation problem. After the mapping, we can apply the standard transfer learning techniques to solve the problem.
Additionally, we introduce a novel pretraining technique for hidden-unit CRFs (HUCRFs) to effectively transfer knowledge from one domain to another. In our experiments, we find that our pretraining method is almost always superior to strong baselines such as the popular domain adaptation method of Daumé III (2007).

Problem description and related work
Let D be the number of distinct domains. Let X i be the space of observed samples for the i-th domain. Let Y i be the space of possible labels for the i-th domain. In most previous works in domain adaptation (Blitzer et al., 2006;Daumé III, 2007), observed data samples may vary but label space is invariant 1 . That is, but X i = X j for some domains i and j. For example, in part-of-speech (POS) tagging on newswire and biomedical domains, the observed data sample may be radically different but the POS tag set remains the same. In practice, there are cases, where the same query is labeled differently depending on the domain or application and the context. For example, Fred Myer can be tagged differently; "send a text message to Fred Myer" and "get me driving direction to Fred Myer ". In the first case, Fred Myer is person in user's contact list but it is a grocery store in the second one.
So, we relax the constraint that label spaces must be the same. Instead, we assume that surface forms (i.e words) are similar. This is a natural setting in developing multiple applications on speech utterances; input spaces (service request utterances) do not change drastically but output spaces (slot tags) might.
Multi-task learning differs from our task. In general multi-task learning aims to improve performance across all domains while our domain adaptation objective is to optimize the performance of semantic slot tagger on the target domain.
Below, we review related work in domain adaption and natural language understanding (NLU).
Multi-task learning has become popular in NLP. Sutton and McCallum (2005) showed that joint learning and/or decoding of sub-tasks helps to improve performance. Collobert and Weston (2008) proved the similar claim in a deep learning architecture. While our problem resembles their settings, there are two clear distinctions. First, we aim to optimize performance on the target domain by minimizing the gap between source and target domain while multi-task learning jointly learns the shared tasks. Second, in our problem the domains are different, but they are closely related. On the other hand, prior work focuses on multiple subtasks of the same data.
Despite the increasing interest in NLU (De Mori et al., 2008;Xu and Sarikaya, 2013;Xu and Sarikaya, 2014;Anastasakos et al., 2014;El-Kahky et al., 2014;Marin et al., 2014;Celikyilmaz et al., 2015;Ma et al., 2015;Kim et al., 2015), transfer learning in the context of NLU has not been much explored. The most relevant previous work is Tur (2006) and Li et al. (2011), which described both the effectiveness of multi-task learning in the context of NLU. For multi-task learning, they used shared slots by associating each slot type with aggregate active feature weight vector based on an existing domain specific slot tagger. Our empirical results shows that these vector representation might be helpful to find shared slots across domain, but cannot find bijective mapping between domains.
Also, Jeong and Lee (2009) presented a transfer learning approach in multi-domain NLU, where the model jointly learns slot taggers in multiple domains and simultaneously predicts domain detection and slot tagging results. 2 To share parameters across domains, they added an additional node for domain prediction on top of the slot sequence. However, this framework also limited to a setting in which the label set remains invariant. In contrast, our method is restricted to this setting without any modification of models.

Sequence Modeling Technique
The proposed techniques in Section 4 and 5 are generic methodologies and not tied to any particular models such as any sequence models and instanced based models. However, because of superior performance over CRF, we use a hidden unit CRF (HUCRF) of Maaten et al. (2011). While popular and effective, a CRF is still a linear model. In contrast, a HUCRF benefits from nonlinearity, leading to superior performance over CRF (Maaten et al., 2011). Thus we will focus on HUCRFs to demonstrate our techniques in experiments.

Hidden Unit CRF (HUCRF)
A HUCRF introduces a layer of binary-valued hidden units z = z 1 . . . z n ∈ {0, 1} for each pair of label sequence y = y 1 . . . y n and observation sequence x = x 1 . . . x n . A HUCRF parametrized by θ ∈ R d and γ ∈ R d defines a joint probability of y and z conditioned on x as follows: where Y(x, z) is the set of all possible label sequences for x and z, and Φ(x, z) ∈ R d and Ψ(z, y) ∈ R d are global feature functions that decompose into local feature functions: HUCRF forces the interaction between the observations and the labels at each position j to go through a latent variable z j : see Figure 1 for illustration. Then the probability of labels y is given by marginalizing over the hidden units, As in restricted Boltzmann machines (Larochelle and Bengio, 2008), hidden units are conditionally independent given observations and labels. This allows for efficient inference with HUCRFs despite their richness (see Maaten et al. (2011) for details). We use a perceptron-style algorithm of Maaten et al. (2011) for training HUCRFs.

Transfer learning between domains with different label sets
In this section, we describe three methods for utilizing annotations in domains with different label types. First two methods are about transferring features and last method is about transferring model parameters. Each of these methods requires some sort of mapping for label types. A fine-grained label type needs to be mapped to a coarse one; a label type in one domain needs to be mapped to the corresponding label type in another domain. We will provide a solution to obtaining these label mappings automatically in Section 5.

Coarse-to-fine prediction
This approach has some similarities to the method of Li et al. (2011) in that shared slots are used to transfer information between domains. In this two-stage approach, we train a model on the source domain, make predictions on the target domain, and then use the predicted labels as additional features to train a final model on the target domain. This can be helpful if there is some correlation between the label types in the source domain and the label types in the target domain. However, it is not desirable to directly use the label types in the source domain since they can be highly specific to that particular domain. An effective way to combat this problem is to reduce the original label types such start-time, contract-info, and restaurant as to a set of coarse label types such as name, date, time, and location that are universally shared across all domains. By doing so, we can use the first model to predict generic labels such as time and then use the second model to use this information to predict fine-grained labels such as start-time and end-time.

Method of Daumé III (2007)
In this popular technique for domain adaptation, we train a model on the union of the source domain data and the target domain data but with the following preprocessing step: each feature is duplicated and the copy is conjoined with a domain indicator. For example, in a WEATHER domain dataset, a feature that indicates the identity of the string "Sunny" will generate both w(0) = Sunny and (w(0) = Sunny) ∧ (domain = W EAT HER) as feature types. This preprocessing allows the model to utilize all data through the common features and at the same time specialize to specific domains through the domain specific features. This is especially helpful when there is label ambiguity on particular features (e.g., "Sunny" might be a weather-condition in a WEATHER domain dataset but a music-song-name in a MUSIC domain dataset).
Note that a straightforward application of this technique is in general not feasible in our situation. This is because we have features conjoined with label types and our domains do not share label types. This breaks the sharing of features across domains: many feature types in the source domain are disjoint from those in the target domain due to different labeling.
Thus it is necessary to first map source domain label types to target domain label type. After the mapping, features are shared across domains and we can apply this technique.

Transferring model parameter
In this approach, we train HUCRF on the source domain and transfer the learned parameters to initialize the training process on the target domain. This can be helpful for at least two reasons: 1. The resulting model will have parameters for feature types observed in the source domain as well as the target domain. Thus it has better feature coverage.
2. If the training objective is non-convex, this initialization can be helpful in avoiding bad local optima.
Since the training objective of HUCRFs is nonconvex, both benefits can apply. We show in our experiments that this is indeed the case: the model benefits from both better feature coverage and better initialization.
Note that in order to use this approach, we need to map source domain label types to target domain label type so that we know which parameter in the source domain corresponds to which parameter in the target domain. This can be a many-toone, one-to-many, one-to-one mapping depending on the label sets.

Pretraining with HUCRFs
In fact, pretraining HUCRFs in the source domain can be done in various ways. Recall that there are two parameter types: θ ∈ R d for scoring observations and hidden states and γ ∈ R d for scoring hidden states and labels (Eq. (1)). In pretraining, we first train a model (θ 1 , γ 1 ) on the source data Then we train a model (θ 2 , γ 2 ) on the target data {(x Here, we can choose to initialize only θ 2 ← θ 1 and discard the parameters for hidden states and labels since they may not be the same. The θ 1 parameters model the hidden structures in the source domain data and serve as a good initialization point for learning the θ 2 parameters in the target domain. This can be helpful if the mapping between the label types in the source data and the label types in the target data is unreliable. This process is illustrated in Figure 2.

Automatic generation of label mappings
All methods described in Section 4 require a way to propagate the information in label types across different domains. A straightforward solution would be to manually construct such mappings by inspection. For instance, we can specify that start-time and end-time are grouped as the same label time, or that the label public-transportation-route in the PLACES domain maps to the label implicit-location in the CALENDAR domain. Instead, we propose a technique that automatically generates the label mappings. We induce vector representations for all label types through canonical correlation analysis (CCA) -a powerful and flexible technique for deriving lowdimensional representation. We give a review of CCA in Section 5.1 and describe how we use the technique to construct label mappings in Section 5.2.

Canonical Correlation Analysis (CCA)
CCA is a general technique that operates on a pair of multi-dimensional variables. CCA finds k dimensions (k is a parameter to be specified) in which these variables are maximally correlated.
Let x 1 . . . x n ∈ R d and y 1 . . . y n ∈ R d be n samples of the two variables. For simplicity, assume that these variables have zero mean. Then CCA computes the following for i = 1 . . . k: In other words, each (u i , v i ) is a pair of projection vectors such that the correlation between the projected variables u i x l and v i y l (now scalars) is maximized, under the constraint that this projection is uncorrelated with the previous i − 1 projections.
This is a non-convex problem due to the interaction between u i and v i . Fortunately, a method based on singular value decomposition (SVD) provides an efficient and exact solution to this problem (Hotelling, 1936). The resulting solution u 1 . . . u k ∈ R d and v 1 . . . v k ∈ R d can be used to project the variables from the original d-and d -dimensional spaces to a k-dimensional space: The new k-dimensional representation of each variable now contains information about the other variable. The value of k is usually selected to be much smaller than d or d , so the representation is typically also low-dimensional.

Inducing label embeddings
We now describe how to use CCA to induce vector representations for label types. Using the same notation, let n be the number of instances of labels in the entire data. Let x 1 . . . x n be the original representations of the label samples and y 1 . . . y n be the original representations of the associated words set contained in the labels.
We employ the following definition for the original representations for reasons we explain below. Let d be the number of distinct label types and d be the number of distinct word types.
• x l ∈ R d is a zero vector in which the entry corresponding to the label type of the l-th instance is set to 1.
• y l ∈ R d is a zero vector in which the entries corresponding to words spanned by the label are set to 1.
The motivation for this definition is that similar label types often have similar or same word. For instance, consider two label types start-time, (start time of a calendar event) and end-time, meaning (the end time of a calendar event). Each type is frequently associated with phrases about time. The phrases {"9 pm", "7", "8 am"} might be labeled as start-time; the phrases {"9 am", "7 pm"} might be labeled as end-time. In these examples, both label types share words "am", "pm", "9", and "7" even though phrases may not match exactly. Figure 3 gives the CCA algorithm for inducing label embeddings. It produces a k-dimensional vector for each label type corresponding to the CCA projection of the one-hot encoding of that label.

Discussion on alternative label representations
We point out that there are other options for inducing label representations besides CCA. For instance, one could simply use the sparse feature vector representation of each label. However, CCA's low-dimensional projection is computationally more convenient and arguably more generalizable. One can also consider training a predictive model similar to word2vec (Mikolov   et al., 2013). But this requires significant efforts in implementation and also very long training time.
In contrast, CCA is simple, efficient, and effective and can be readily implemented. Also, CCA is theoretically well understood while methods inspired by neural networks are not.

Constructing label mappings
Vector representations of label types allow for natural solutions to the task of constructing label mappings.

Mapping to a coarse label set
Given a domain and the label types that occur in the domain, we can reduce the number of label types by simply clustering their vector representations. For instance, if the embeddings for start-time and end-time are close together, they will be grouped as a single label type. We run the k-means algorithm on the label embeddings to obtain this coarse label set. Table 1 shows examples of this clustering. It demonstrates that the CCA representations obtained by the procedure described in Section 5.2 are indeed informative of the labels' properties.   appointment to an earlier time. For example, in the query "move the dentist's appointment up by 30 minutes.", the phrase "30 minutes" is tagged with move earlier time. The role of this tag is very similar to the role of Travel time in PLACES (not Time) and Duration in ALARMS (not Start date), and CCA is able to recover this relation.

Experiments
In this section, we turn to experimental findings to provide empirical support for our proposed methods.

Setup
To test the effectiveness of our approach, we apply it to a suite of eight Cortana personal assistant domains for slot sequence tagging tasks, where the goal is to find the correct semantic tagging of the words in a given user utterance. The data statistics and short descriptions are shown in Table 2. As the table indicates, the domains have very different granularity and diverse semantics.

Baselines
In all our experiments, we trained HUCRF and only used n-gram features, including unigram, bigram, and trigram within a window of five words (±2 words) around the current word as binary feature functions. With these features, we compare the following methods for slot tagging: • NoAdapt: train only on target training data.
• Union: train on the union of source and target training data.
• Daume: train with the feature duplication method described in 4.2.
• C2F: train with the coarse-to-fine prediction method described in 4.1.
• Pretrain: train with the pretraining method described in 4.3.1.
To apply these methods except for Target, we treat each of the eight domains in turn as the test domain, with one of remaining seven domain as the source domain. As in general domain adaptation setting, we assume that the source domain has a sufficient amount of labeled data but the target domain has an insufficient amount of labeled data. Specifically, For each test or target domain, we only use 10% of the training examples to simulate data scarcity. In the following experiments, we report the slot F-measure, using the standard CoNLL evaluation script 3  To assess the quality of our automatic mapping methods via CCA described in Section 5, we compared against manually established mappings and also the mapping method of Li et al. (2011). The method of Li et al. (2011) is to associate each slot type with the aggregate active feature weight vectors based on an existing domain specific slot tagger (a CRF). Manual mapping were performed   In contrast, the proposed CCA based technique consistently outperforms the NoAdapt baselines by significant margins. More importantly, it also outperforms manual results under all conditions. It is perhaps not so surprising -the CCA derived mapping is completely data driven, while human annotators have nothing but the prior linguistic knowledge about the slot tags and the domain.

Main Results
The full results are shown in Table 5, where all pairs of source and target languages are considered for domain adaptation. It is clear from the table that we can always achieve better results using adaptation techniques than the non-adapted models trained only on the target data. Also, our proposed pretraining method outperforms other types of adaptation in most cases.
The overall result of our experiments are shown in Table 4. In this experiment, we compare different adaptation techniques using our suggested CCA-based mapping. Here, except for NoAdapt, we use both the target and the nearest source domain data. To find the nearest domain, we first map fine grained label set to coarse label set by using the method described in Section 5.4.1 and then count how many coarse labels are used in a domain. And then we can find the nearest source domain by calculating the l 2 distance between the multinomial distributions of the source domain and the target domain over the set of coarse labels.
For example, for CALENDAR, we identify REMINDER as the nearest domain and vice versa because most of their labels are attributes related to time. In all experiments, the domain adapted models perform better than using only target domain data which achieves 75.1% F1 score. Simply combining source and target domain using our automatically mapped slot labels performs slightly better than baseline. C2F boosts the performance up to 77.61% and Daume is able to reach 78.99%. 4 Finally, our proposed method, pretrain achieves nearly 81.02% F1 score.

Conclusion
We presented an approach to take advantage of existing annotations when the data are similar but the label sets are different. This approach was based on label embeddings from CCA, which reduces the setting to a standard domain adaptation problem. Combined with a novel pretraining scheme applied to hidden-unit CRFs, our approach is shown to be superior to strong baselines in extensive experiments for slot tagging on eight distinct personal assistant domains.