A General Regularization Framework for Domain Adaptation

We propose a domain adaptation framework, and formally prove that it generalizes the feature augmentation technique in (Daum´e III, 2007) and the multi-task regularization framework in (Evgeniou and Pontil, 2004). We show that our framework is strictly more general than these approaches and allows practitioners to tune hyper-parameters to encourage transfer between close domains and avoid negative transfer between distant ones.


Introduction
Domain adaptation (DA) is an important problem that has received substantial attention in natural language processing (Blitzer et al., 2006;Daumé III, 2007;Finkel and Manning, 2009;Daumé III et al., 2010). In this paper, we propose a novel regularization framework which allows DA practitioners to tune hyper-parameters to encourage transfer between close domains, and avoid negative transfer (Rosenstein et al., 2005) between distant ones. In our framework, model parameters in multiple domains are learned jointly and constrained to remain close to one another. In the transfer learning taxonomy (Pan and Yang, 2010), our framework falls under the parameter-transfer category for multi-task inductive learning. We show that our framework generalizes the frustratingly easy domain adaptation (FEDA) in Daumé III (2007), Finkel and Manning (2009), and the regularised multi-task learning of Evgeniou and Pontil (2004). At the same time, it provides us with hyper-parameters to control the amount of transfer between domains.

Domain Adaptation Framework
Given labeled data from N domains, D 1 , . . . , D N , traditional machine learning maximizes the following objective function for each domain D i : and we maximize L i by tuning the parameter vector w i . For example, L i can be the log-likelihood or the negative hinge loss. The term λ i ||w i || 2 is the L 2regularization term where λ i is a positive scalar. In our framework, we propose to maximize where η j,k are parameters controlling the transfer between domains. In the next sections, we show how our framework generalizes existing works.

Frustratingly Easy DA
The FEDA approach was introduced by Daumé III (2007) and later formalized by Finkel and Manning (2009) within a hierarchical Bayesian DA framework. While simple, the approach has often been shown to be effective. In this section, we show that our framework generalizes the FEDA approach. The FEDA approach defines a new augmented feature space by duplicating each feature in D i to a "general" domain. Therefore each parameter in w i has a corresponding parameter in w 0 , and: This directly leads to the following remark: Remark For all i, for any w i , w 0 , d ∈ R m : The complete objective function involving N (N ≥ 2) domains is defined as follows: We first prove the following relation: where λ 0 , λ 1 , . . . , λ N > 0, then: Proof Let's introduce the vector d as follows: Based on the remark, Hence, ∆ = 0 implying ||d|| = 0 and so d = 0. From the definition of d, Equation 4 holds.
Next we state the following lemma (see supplementary material for the proof).

Lemma 2.2 For any vectors
then the following always holds: Now we state and prove the following theorem, which shows our framework generalizes FEDA.
the following holds: be a solution to the first optimization problem. We have: be an optimal solution to the second problem. Given the relation between η i,j and λ 0 , λ 1 , . . . , λ N , Based on these and Lemma 2.2, we have: This formally shows that FEDA is equivalent to solving the objective function given in Equation 2. In this new optimization problem, if we drop the terms involving η j,k for j = 0, we have: This is learning without domain adaptation. The additional regularization terms allow us keep the parameters from different domains close to one other. In the special case with two domains, if we use the same λ for all regularization terms, we have the following corollary: Corollary 2.4 For any λ > 0: Hence, the FEDA feature augmentation technique indirectly introduces a regularization term that pushes the source and target parameters as close as possible. This is related to the technique of Chelba and Acero (2006) where they regularize the model parameters for the target domain using the term λ||w − w s ||, where w s is the parameter vector learned from the source domain. The difference here is, in their work the parameters for the source domain are learned first and then fixed. The relation between their work and the feature augmentation technique was also briefly discussed in the paper of Daumé III (2007). We formally showed a precise relation here in this paper. Evgeniou and Pontil (2004) proposed multi-task regularized learning using support vector machines (SVM). They decomposed the model weight vector as a sum of domain-specific vectors and a general vector, in much the same way as FEDA 1 . Hence, both Lemma 2.1 and Theorem 2.3 of this paper apply, and our framework also generalizes multi-task regularized learning.

Experimental Results
In this section we apply our framework to both structured and un-structured tasks. For structured prediction, we use the named-entity recognition (NER) ACE-2005 dataset with 7 classes and 6 domains. We apply the linear chain CRF (Lafferty et al., 2001), and show results using standard and softmaxmargin CRF (SM-CRF) (Gimpel and Smith, 2010), with features consisting of word shape features, neighboring words, previous prediction and prefixes/suffixes. The second task is sentiment classification on the Amazon review data set (Blitzer et al., 2007) from 4 domains, labeled positive or negative. We apply logistic regression (LR) and SVM using unigram and bigram features. All the models used in this section are implemented on top of a common framework, which was also used to implement various structured prediction models previously (Lu, 2015;Lu and Roth, 2015;Muis and Lu, 2016). For each task we compare: TGT Trained only on the specific domain data, ALL Trained on the data from all domains,   AUG The FEDA approach, and RF Our proposed regularization framework.
We use a 40/30/30 train-development-test split and report the results on the test set. The regularization parameters were tuned on the development set over a logarithmic scale between 10 −3 to 10 3 . For our framework, we used random search to tune the parameters, since an exhaustive search is too expensive (21 parameters for 6 domains). We choose the within-domain η 0,i to be close to those used for the ALL and AUG model, while choosing the other η j,k to be 1-2 orders of magnitude higher. A good model could quickly be found that generally beats the baselines on the development set and also generalizes well to the test set. We show the results for NER in Table 1 and the sentiment task in Table 2.

953
Our proof did not require any assumption about L, as long as L 2 regularization is used. This means our result is applicable to a variety of models such as SVM, LR, and CRF (where L 2 regularization is used for the latter two models). Theoretically, we have shown the equivalence of DA optimization problems. Empirically, for non-convex objectives, different approaches may arrive at different solutions. However, for convex loss functions, our objective (Equation 2) is also convex, and all approaches should share the same solution.
We have shown that we can map the FEDA optimization problem to our framework. The converse is false: for any problem in this family (with arbitrary choices of η), we can only solve it using FEDA if there are only 2 domains, or if all regularization hyper-parameters are equal. Some parameter configurations in this family are "unreachable" by the feature augmentation technique. This is because in Theorem 2.3, the values of η's are defined based on λ's and therefore possess certain properties. For example, they must at least satisfy such constraints as η i,k η k,j = η i,l η l,j for any i ≤ k, l ≤ j. We have seen that some of those unreachable problems could give us better empirical results. Can we find an alternative simple adaptation method such that all problems in this family are "reachable"? This is a question that needs to be addressed in future research.

Conclusion
In this paper, we presented a framework for domain adaptation that generalizes several previous works (Daumé III, 2007;Finkel and Manning, 2009;Evgeniou and Pontil, 2004). Our approach allows practitioners to specify the amount of transfer between domains via regularization hyper-parameters. These parameters could be tuned based on intuition or using held-out data. In future work we could also seek to find methods that can automatically optimize these parameters. The supplementary material of this paper is available at http://statnlp.org/research/ml/.