Domain-Invariant Feature Distillation for Cross-Domain Sentiment Classification

Cross-domain sentiment classification has drawn much attention in recent years. Most existing approaches focus on learning domain-invariant representations in both the source and target domains, while few of them pay attention to the domain-specific information. Despite the non-transferability of the domain-specific information, simultaneously learning domain-dependent representations can facilitate the learning of domain-invariant representations. In this paper, we focus on aspect-level cross-domain sentiment classification, and propose to distill the domain-invariant sentiment features with the help of an orthogonal domain-dependent task, i.e. aspect detection, which is built on the aspects varying widely in different domains. We conduct extensive experiments on three public datasets and the experimental results demonstrate the effectiveness of our method.


Introduction
Sentiment classification based on deep learning methods has developed rapidly in recent years. While achieving outstanding performance, these methods always need large-scale datasets with sentiment polarity labels to train a robust sentiment classifier. However, in most cases, largescale labeled datasets are not available in practice and manual annotation costs much. One of the solutions to this problem is cross-domain sentiment classification, which aims to exploit the rich labeled data in one domain, i.e. source domain, to help the sentiment analysis task in another domain lacking for or even without labeled data, i.e. target domain. The rationality of this solution is that the source domain and target domain * Work performed while interning at IBM Research -China.
The fried rice is amazing here.
Surprisingly, Britney Spears is amazing.
Source Domain: Target Domain: share some domain-invariant knowledge that can be transferred across domains. Previous works on cross-domain sentiment classification mainly focus on learning the domain-invariant representations in both source and target domains, either based on manual feature selection (Blitzer et al., 2006;Pan et al., 2010) or automatic representation learning (Glorot et al., 2011;Chen et al., 2012;Ganin and Lempitsky, 2015;. The sentiment classifier, which makes decisions based on the domaininvariant features and receives the supervisory signals from the source domain, can be also applied to the target domain. We can draw an empirical conclusion: the better domain-invariant features the method obtains, the better performance it gains. However, few studies explore the usage of the domain-specific information, which is also helpful to the cross-domain sentiment classification. Peng et al. (2018) propose to extract the domain-invariant and domain-dependent features of the target domain data and train two classifiers accordingly, but they require a few sentiment polarity labels in the target domain, which limits the practical application of the method.
In this paper, we exploit the domain-specific information by adding an orthogonal domaindependent task to "distill" the domain-invariant features for cross-domain sentiment classification. The proposed method domain-invariant feature distillation (DIFD) does not need any sentiment polarity labels in the target domain, which is more consistent with the practical settings. Specifically, we focus on the aspect-level cross-domain sentiment classification, and train a shared sentiment classifier and two respective aspect detectors in the source and target domains. We argue that aspect detection is an orthogonal domain-dependent task with respect to the sentiment classification. As shown in Figure 1, given an input sentence, the sentiment classifier predicts its sentiment polarity based on the opinion words shared by different domains, while the aspect detector identifies the aspect terms which vary significantly across domains. The information on which the two tasks depend is mutually exclusive in the sentence, i.e. orthogonal. Therefore, by training these two tasks simultaneously, the aspect detectors will try to strip the domain-specific features from the input sentence and make the domain-invariant features purer, which is helpful to the cross-domain sentiment classification.
Moreover, we design two effective modules to boost the distillation process. One is the wordlevel context allocation mechanism. It modulates the importance of the words in the input sentence according to the property of different tasks. The other is the domain classifier. It tries to correctly judge which domain the domain-invariant feature comes from, while the other modules in the proposed method try to "fool" it, and the whole framework is trained in an adversarial way.
To summarize, the main contributions of our paper are as follows: • We distill the domain-invariant sentiment features to improve the cross-domain sentiment classification by simultaneously training an aspect detection task that striping the domain-specific aspect features from the input sentence.
• We boost the separation process of the domain-invariant and domain-specific features by two effective modules which are the context allocation mechanism and domain classifier respectively.
• Experimental results demonstrate the effectiveness of the proposed method, and we further verify the rationality of the context allocation mechanism by visualization.

Related Work
Cross-domain sentiment analysis: Many domain adaptation methods have been proposed for sentiment analysis. SCL (Blitzer et al., 2006) learns correspondences among features from different domains. SFA (Pan et al., 2010) aims at reducing the gap between domains by constructing a bipartite graph to model the co-occurrence relationship between domain-specific words and domain-independent words. SDA (Glorot et al., 2011) learns to extract a meaningful representation for each review in an unsupervised fashion. mSDA (Chen et al., 2012) is an efficient method to marginalize noise and learn features. Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015;Ganin et al., 2016; is employed to learn domain-invariant representations by fooling the domain classifier. The replacement of gradient reversal with alternating minimization (Shu et al., 2018) stabilizes domain adversarial training, and we employ this method as the adversarial training.
Aspect-level sentiment domain adaptation: To the best of our knowledge, there are two works about aspect-related cross-domain sentiment classification. Li et al. (2019) propose a method to employ abundant aspect-category data to assist the scarce aspect-term level sentiment prediction.  propose IATN to address that aspects have different effects in different domains. Their method predicts sentiment polarity for the whole sentence rather than a specific aspect. Our method concentrates on aspect-term level sentiment domain adaptation by separating the domain-specific aspect features. Bousmalis et al. (2016) and Liu et al. (2017) separate features into two subspaces by introducing constraints on the learned features. The difference is that our method is more fine-grained and utilizes the explicit aspect knowledge. Auxiliary task for sentiment domain adaptation: Auxiliary task has been employed to improve cross-domain sentiment analysis. Yu and Jiang (2016) use two pivot-prediction auxiliary tasks to help induce a sentence embedding, which works well across domains for sentiment classification. Yu and Jiang (2017) propose to jointly learn domain-independent sentence embeddings by auxiliary tasks to predict sentiment scores of domain-independent words. Chen et al. (2018)  3 Methodology

Formulation and Overview
Suppose the source domain contains labeled data D s = {(x k s , a k s ), y k s } Ns k=1 , and the target domain , where x is a sentence, a is one of the aspects in x, and y is the sentiment polarity label of a. The proposed method handles two kinds of tasks. One is the main task Aspect-level Sentiment Classification (ASC). It learns a mapping F : {x} → {f } → {y} shared by source and target domains, where f is the domain-invariant feature of x. The other is the orthogonal domain-dependent task Aspect Detection (AD). It learns a mapping where z s and z t are both domaindependent features of x s and x t respectively. The domain-invariant and domain-dependent features are orthogonal, i.e. f ⊥ z s and f ⊥ z t . We facilitate the distillation of f by simultaneously learning G s and G t which try to strip z s and z t from x, and the purer f leads to the better F for the cross-domain sentiment classification. Figure 2 illustrates the architecture overview of our method. Given an input sentence either from the source or target domain, we first feed it into the sentence encoder to obtain its dis-tributed representation. Then, the context allocation mechanism divides the distributed representation into two orthogonal parts: domain-invariant and domain-dependent features. Finally, the two orthogonal features are fed into their corresponding downstream tasks. Specifically, we input the domain-invariant feature into the sentiment classifier to predict the sentiment polarity, and input the domain-dependent feature into the aspect detector of the specific domain to identify the aspect terms. In addition, we add a domain classifier to the architecture. It tries to correctly judge which domain the domain-invariant feature comes from. The whole framework is trained in an adversarial way. Next, we will introduce the components of our method in detail.

Sentence Encoder
Given an input sentence x = {w 1 , w 2 , ..., w n }, we first map it into an embedding sequenceÊ = {e 1 , e 2 , ..., e n } ∈ R n×de . Then we inject the positional information of each token in x intoÊ to obtain the final embedded representation E, following the Position Encoding (PE) method in the work (Vaswani et al., 2017): P E(pos, 2i) = sin(pos/10000 2i/de ) P E(pos, 2i + 1) = cos(pos/10000 2i/de ) where pos is the word position in the sentence and i is the i-th dimension of d e . We consider that the injected positional information can facilitate the aspect-level sentiment classification, based on the observation that sentiment words tend to be close to its related aspect terms (Tang et al., 2016;Chen et al., 2017).
Next we employ a Bi-directional LSTM (BiL-STM) (Graves et al., 2013) to encode E into the contextualized sequence representation H = [h 1 , h 2 , ...h n ] ∈ R n×2d h , which preserves the contextual information of each token in the input sentence.
We unify the embedding layer and BiLSTM as the sentence encoder in which different tasks or domains all share the same weights. The advantages of sharing the weights are two-fold: first, different tasks in the same domain can benefit from each other in a multi-task manner; second, distilling the domain-invariant feature from a common transformation is more simple.

Context Allocation (CA)
In an input sentence, some words have a strong bias towards domain-specific information, such as the aspect terms, e.g. "pizza" in Restaurant domain, while others focus on the domaininvariant knowledge, such as the opinion words, e.g. "amazing". Meanwhile, the ASC task and AD task exactly require orthogonal information as discussed before. Therefore, we argue that different words contribute differently according to the property of the downstream task. To facilitate the distillation of the domain invariant features, we propose a Context Allocation (CA) mechanism to allocate different weights on the same word in different downstream tasks. The values of the weights depend on how the information contained in the word matches the need of the specific task. Concretely, at each time step i, the module divides the contextualized representation h i of word w i into the sentiment-dominant context h c i and aspect-dominant context h d i as follows: The two-dimensional vector β i = (β c i , β d i ) is normalized considering that the domain-specific information and domain-invariant knowledge are mutually exclusive. It reflects the importance of w i on the ASC task and AD task respectively, and is calculated on h i as follows: where W a ∈ R 2d h ×2d h and W b ∈ R 2×2d h . The whole division process at all time steps can be formulated in the following form: where β = [β 1 , β 2 , ..., β n ]. The sentimentdominant context H c ∈ R n×2d h and aspectdominant context H d ∈ R n×2d h of the input sentence are then fed into the ASC task and AD task for downstream processing respectively.

Aspect-level Sentiment Classification (ASC) Task
Aspect-Opinion Attention In the ASC task, we design an attention mechanism to model the relationship between the position of the aspect terms and their corresponding opinion words. For a specific aspect term, the domain-invariant feature based on the aspect-opinion attention contains more information of its corresponding opinion words, which is beneficial to the final aspectlevel sentiment classification. Specifically, we first calculate the position representation of a specific aspect term with its position x a and the sentimentdominant context H c : where x a = {0 1 , ..., 1 i+1 , ..., 1 i+m , ...0 n } represents the word positions of an aspect subsequence in the input sentence x with non-zero values and m is the length of the aspect. Then the representation h a ∈ R 2d h is further utilized to calculate the sentiment-dominant feature f , which is domain-invariant and should be aligned across source and target domains.
where γ i reflects how much the word w i corresponds with the opinion on the aspect term, and W p ∈ R 2d h ×2d h and b p ∈ R 1 are weight matrix and bias respectively.
Sentiment Classification Loss The sentimentdominant features f s and f t generated from the source and target domains respectively share the same sentiment classifier. Note that the source domain data has sentiment polarity label, while the target domain is unlabeled. Thus we train the sentiment classifier only with the labeled data in the source domain, while utilizing it for inference in both source and target domains. The training objective of the sentiment classifier is to minimize the following loss on the source domain dataset, which is marked as L c s : where y is the ground-truth sentiment polarity label. For simplicity, we omit the enumerated number of the instance in the loss equation.
Domain Adversarial Loss The domain classifier maps the sentiment-dominant feature f into a two-dimensional normalized value y = (y s , y t ), which indicates the probability that f comes from the source and target domains respectively. The ground-truth domain label is g s = (1, 0) for instances in the source domain, and g t = (0, 1) in the target domain. The training objective of the domain classifier is to minimize the following loss on both source and target domain datasets, which is marked as L θ D a : The part in our architecture which joins the generating process of f (including Sentence Encoder, Context Allocation and Aspect-Opinion Allocation in Figure 2) can be regarded as a domaininvariant feature extractor, which works with the domain classifier in an adversarial way. To further accelerate the distillation process of the domaininvariant features, we also introduce an adversarial loss of the domain classifier for the feature extractor. Specifically, we calculate the loss in Equation  10 with the flipped domain labels inspired by the work (Shu et al., 2018):

Aspect Detection (AD) Task
We model the AD task as a sequence labeling problem, and each word in the sentence is marked as a tag in {B, I, O}, which means the word is at the beginning (B) or the inside (I) of an aspect term or other word (O). In this way, we can detect Train All parameters except Domain Classifier with L;

3:
Train Domain Classifier with λ a L θ D a ; 4: until performance on the validation set does not improve in 10 epochs.
all the aspect terms of an input sentence in one forward pass. Specifically, we first linearly transform the aspect-dominant hidden state h d into a threedimensional vector. Then we calculate the aspect detection loss of the source domain as follows: where y d is the ground-truth aspect label, n is the sentence length and λ l is the weight of different labels. The weight λ l aims to solve the class imbalance problem because the words labeled by O usually make up the majority of one sentence. It is dynamically calculated in the training phase according to the ratio of the words with a specific label in each batch. Henceforth we denote the loss of the AD task in the target domain as L d t .

Training
We combine each component loss into an overall object function: where λ a and λ d balance the effect of the domain classifier and the auxiliary task (i.e. aspect detection). L and λ a L θ D a are alternatively optimized. The aspect-level sentiment analysis in the unlabeled target domain is predicted by the ASC task.

Datasets
To make an extensive evaluation, we employ three different datasets: Restaurants (R) and Laptops (L) from SemEval 2014 task 4 (Pontiki et al., 2014), and Twitters (T) from the work (Dong et al., 2014). The statistics of these three datasets are shown in Table 1. Specifically, we collect the aspect-term level sentences and corresponding labels from these datasets. Comparing aspect terms in these three datasets, we find more than 98% aspect terms are different between Restaurants and Laptops domains, and there exists no same aspect between Restaurants and Twitters, also only 0.09% same aspects between Laptops and Twitters. This indicates that the aspect terms vary violently in different domains.

Experimental Settings
To evaluate our proposed method, we construct six aspect-level sentiment transfer tasks: R→L, L→R, R→T, T→R, L→T, T→L. The arrow indicates the transfer direction from the source domain to the target domain. For each transfer pair D s → D t , the training set is composed of two parts: one is the labeled training set in D s , and the other is all unlabeled data which only contain the aspect term information in D t . The test set in D s is employed as the validation set. The reported results are evaluated on all the data of D t . The word embeddings are initialized with 100dimension Glove vectors (Pennington et al., 2014) and fine-tuned during the training. The model hidden size d h is set to be 64. The model is optimized by the SGD method with the learning rate of 0.01. The batch size is 32. We employ ReLU as the activation function.
We adopt an early stop strategy during training if the performance on the validation set does not improve in 10 epochs, and the best model is chosen for evaluation.

Compared Methods
We compare with extensive baselines to validate the effectiveness of the proposed method. Some variants of our approach are also compared for analyzing the impacts of individual components. Transfer Baseline: The aspect-level crossdomain sentiment classification has been rarely explored. We choose the state-of-the-art method IATN  which has the most similar settings with our method as the transfer baseline. It proposes to incorporate the information of both sentences and aspect terms in the crossdomain sentiment classification. Non-Transfer Baselines: The non-transfer baselines are all representative methods in recent years for the aspect-level sentiment classification in a single domain. We train the models on the training set of the source domain, and directly test them in the target domain without domain adaptation.
• AT-LSTM (Wang et al., 2016): It utilizes the attention mechanism to generate an aspectspecific sentence representation. • ATAE-LSTM (Wang et al., 2016): It also employs attention. The difference with AT-LSTM is that the aspect embedding is as input to LSTM.    Table 3: Evaluation results of variants of our model in terms of accuracy(%) and macro-f1(%). The minus sign (-) means to remove the module, and the addition (+) means to add the module.

Experimental Analysis
We report the classification accuracy and macro-f1 of various methods in Table 2 and Table 3, and the best scores on each metric are marked in bold. To validate the effectiveness of our method, we analyze the results from the following perspectives.
Compare with the baselines: We display the comparison results with baselines in Table 2.
Comparing with the transfer baseline IATN, we observe that DIFD significantly outperforms IATN on all metrics by +5.51% accuracy and +7.36% macro-f1 on average. This shows that the distillation of domain-invariant features really facilitates the transfer of sentiment information across domains. In addition, for a fair comparison with the non-transfer methods which only exploit the source domain data, we also train our DIFD model without the target domain data and denote this variant as DIFD(S). We observe that DIFD(S) outperforms all the non-transfer baselines on most metrics. It is worth noting that, compared to a strong baseline IAN, DIFD(S) achieves significant improvement by +4.49% accuracy on T→R and +7.05% macro-f1 on R→L. This verifies that the orthogonal task is helpful in striping the domain-specific features from the source domain and effective for accelerating the domain adaptation.
Compare the variants of our method: The results of the variants of our method are reported in Table 3. We first observe that DIFD outperforms ASC+AT on all metrics significantly. This validates that the orthogonality really helps to distill the domain-invariant features and improve the performance of the cross-domain sentiment classification.
Then we can see that DIFD-CA performs much worse than DIFD, which reveals that the context allocation mechanism plays an important role in our method. We further visualize the allocation scores in Figure 3 and the result also indicates that the reasonability of the CA module. The gray tokens and red tokens have a bias towards ASC task and AD task respectively. The allocation scores are consistent with the bias of words: red tokens get larger scores for the aspect detection task, while gray tokens get larger scores for opinion expressions. This shows that our model generates task-oriented contexts successfully.
Finally, DIFD also achieves improvement over DIFD-AT on most metrics. This indicates that  adversarial training with the domain classifier promotes the distillation process of the domaininvariant features. To further validate the effectiveness of adversarial training, we also try to directly minimize the divergence between domaininvariant features from source and target domains based on MMD and CORAL. Comparing with DIFD-AT+MMD and DIFD-AT+CORAL, DIFD is more robust considering that DIFD outperforms the two methods in most experimental settings.

Transfer Distance Analysis
In this section, we analyze the similarity of features between domains. We exploit the A-distance (Ben-David et al., 2007) to measure the similarity between two probability distributions. The proxy A-distance is 2(1-2 ), where is the generalization error of a classifier (a linear SVM) trained on the binary classification problem to distinguish inputs between the two domains. We focus on the methods ASC+AT and DIFD, and first compare the similarity of domain-invariant features f s and f t . Figure 4 reports the results for each pair of domains. The proxy Adistance on DIFD is generally smaller than its cor- responding value on ASC+AT. This indicates that DIFD can learn purer domain-invariant features than ASC+AT. Secondly, we compare the domainspecific features learned by ASC+AT and DIFD, which are represented by the average hidden state of BiLSTM in ASC+AT and the average aspectcontext H d in DIFD respectively. Figure 5 reports the results for each pair of domains. The proxy Adistance on DIFD is generally larger than its corresponding value on ASC+AT, which demonstrates that DIFD can strip more domain-specific information by the aspect detection task than ASC+AT. There are exceptions in both Figures, i.e., TR in Figure 4, TL and LR in Figure 5. A possible explanation is that the balance between ASC and AD losses causes some domain-specific information to remain in the domain-invariant space, and vice versa.

Conclusion
In this work, we study the problem of aspectlevel cross-domain sentiment analysis and propose a domain-invariant feature distillation method that simultaneously learns domain-invariant and domain-specific features. With the help of the orthogonal domain-dependent task (i.e., aspect detection), the aspect sentiment classification task can learn better domain-invariant features and improve transfer performance. Experimental results clearly verify the effectiveness of our method.

Acknowledgement
This work is supported by National Science and Technology Major Project, China (Grant No. 2018YFB0204304).