Unified Feature and Instance Based Domain Adaptation for End-to-End Aspect-based Sentiment Analysis

The supervised models for aspect-based sentiment analysis (ABSA) rely heavily on labeled data. However, ﬁne-grained labeled data are scarce for the ABSA task. To alleviate the dependence on labeled data, prior works mainly focused on feature-based adaptation, which used the domain-shared knowledge to construct auxiliary tasks or domain adversarial learning to bridge the gap between domains, while ignored the attribute of instance-based adaptation. To resolve this limitation, we pro-pose an end-to-end framework to jointly perform feature and instance based adaptation for the ABSA task in this paper. Based on BERT, we learn domain-invariant feature representations by using part-of-speech features and syntactic dependency relations to construct auxiliary tasks, and jointly perform word-level instance weighting in the framework of sequence labeling. Experiment results on four benchmarks show that the proposed method can achieve signiﬁcant improvements in comparison with the state-of-the-arts in both tasks of cross-domain End2End ABSA and cross-domain aspect extraction.


Introduction
Aspect extraction and aspect sentiment classification are two important sub-tasks in Aspect Based Sentiment Analysis (ABSA) (Liu, 2012;Pontiki et al., 2016), which aim to extract aspect terms and predict the sentiment polarities of the given aspect terms, respectively. Since these two sub-tasks have been well studied in the literature, a number of recent studies focus on the End2End ABSA task by employing a unified tagging scheme to tackle the two sub-tasks in an end-to-end manner (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019a). The unified tagging scheme fuses aspect boundary * Equal contribution. † Corresponding author.
tags {B, I, O} and sentiment polarities {POS, NEG, NEU} together, and formulates End2End ABSA as a sequence labeling problem. For example, given a sentence "The price is reasonable, although the service is poor.", the End2End ABSA task aims to jointly extract aspect terms and detect sentiment polarities over them. The extracted pairs in this example are {"price": Positive; "service": Negative}. However, these existing studies heavily rely on supervised learning over a large amount of labeled data, which is usually hard to obtain for ABSA due to the intensive nature of human annotation. Therefore, it will be very attractive to explore the End2End ABSA task in a cross-domain setting, which allows us to train a robust ABSA model for a resource-poor target domain based on enough annotated data in a resource-rich source domain. Traditional domain adaptation methods primarily focus on coarse-grained sentiment classification (Blitzer et al., 2007;Pan et al., 2010;Glorot et al., 2011;Bollegala et al., 2012;Xia et al., 2013;Yu and Jiang, 2016;Ganin et al., 2016;Li et al., 2018b). Most of these methods can be grouped into two categories: feature-based domain adaptation and instance-based domain adaptation. Featurebased methods focus on finding a new feature representation which could reduce domain discrepancy. Instance-based methods aim to re-weight training samples in source domain which essentially attempts to assign higher weights to instances similar to the target domain, and lower weights to instances different from the target domain.
In contrast, due to the difficulty in fine-grained domain adaptation, only a few approaches have been proposed for cross-domain ABSA. Most of them explored cross-domain ABSA from the feature-based adaptation perspective, aiming to induce domain-invariant representations for each word. Specifically, Ding et al. (2017) and Wang and Pan (2018) proposed to use domain-shared syntactic knowledge to construct auxiliary tasks to reduce domain disparity. More recently, Li et al. (2019b) used the memory network to model the syntactic relations between words and designed a selective adversarial learning strategy to achieve word-level adaptation. However, all these methods are still based on traditional neural network architectures. As we all know, with the recent trend of pre-training in NLP (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018), many pre-trained text encoders such as BERT have demonstrated their strong capability for domain-invariant representation learning, which poses new challenges for domain adaptation. Based on our preliminary experiments, we find that simply using BERT without domain adaptation has already obtained indistinguishable performance compared with previous domain adaptation methods. Therefore, it will be more attractive to extend these feature-based adaptation approaches to pre-trained models and further improve the domain adaptation performance.
Apart from the feature-based domain adaptation methods, Jiang and Zhai (2007) pointed out the importance of performing instance-based adaptation for different NLP tasks. As revealed by the theoretical analysis in Jiang and Zhai (2007), the domain discrepancy mainly comes from feature mismatches and instance mismatches, and needs to be jointly modeled from two attributes. However, previous studies only demonstrated the importance of instance-based domain adaptation in coarse-grained sentiment classification (Xia et al., 2014), and it is still unclear how to perform instance adaptation for the ABSA task.
To address the two challenges mentioned above, we first utilize BERT to learn domain-invariant features for the ABSA task, followed by proposing an instance weighting method for cross-domain ABSA. Finally, we integrate them into an end-toend framework to jointly perform feature and instance adaptation. Specifically, for feature-based adaptation, we use the domain-shared part-ofspeech information and dependency relations as self-supervised signals to enhance BERT to learn domain-invariant representation for cross-domain ABSA. For instance-based adaptation, since ABSA is typically modeled as a word-level prediction task, we propose to leverage a domain classier to dynamically learn an importance weight for each word and re-weight different words from the source domain during supervised training. Finally, we propose a unified framework to jointly perform feature and instance-based adaptation via sequential learning and joint learning, respectively. Experimental results on four benchmark datasets show that our method can significantly improve the performance of cross-domain End2End ABSA and cross-domain aspect extraction, and we further carry out ablation studies to quantitatively measure the effectiveness of each component in our unified framework.
The main contributions of this paper can be summarized as follows: • To the best of our knowledge, we are the first to address both tasks of cross-domain End2End ABSA and cross-domain aspect extraction based on BERT. • We propose a Unified Domain Adaptation (UDA) framework encompassing both featurebased adaptation and instance-based adaptation, which can significantly improve the performance of the fine-tuned BERT model without domain adaptation. • Compared with the state-of-the-art domain adaptation method, our UDA approach gains an average improvement of 6.92% on Micro-F1 for cross-domain End2End ABSA.

Problem Statement
Following Li et al. (2019b), we model both the End2End ABSA task and the aspect extraction task as sequence labeling problems. The input is a sequence of tokens w = {w 1 , w 2 , ..., w T }, and the output is a sequence of labels y = {y 1 , y 2 , ..., y T }. For the End2End ABSA task, y i ∈ {B-POS, I-POS, B-NEG, I-NEG, B-NEU, I-NEU, O}; for the aspect extraction task, y i ∈ {B, I, O}.
In this paper, we focus on unsupervised domain adaptation, where labeled data are not available in the target domain. Given a set of labeled tokens from a source domain , and a set of unlabeled instances from a target domain , our goal is to predict token labels for target test instances: . The essential cause of domain adaptation is that the data distribution of the source domain and that of the target domain are different, i.e., P s (w, y) = P t (w, y). The optimal model f * t for the target domain could be obtained by minimizing the following expected loss: In unsupervised domain adaptation, since labeled data are not available in the target domain. We therefore minimize the empirical loss of data drawn from the source domain instead: (2) According to the last line in Equation 2, as P (w, y) can be factored into P (y|w)P (w), an ideal domain adaptation model consider the following two attributes: • feature-based adaptation, which needs to find a general feature representation w under which Pt(y|w) Ps(y|w) → 1; • instance-based adaptation, which uses r(w) =

Pt(w)
Ps(w) as weights for sampling the instances in the source domain.
However, most previous domain adaptation methods in ABSA only presume feature-based adaptation which leverage auxiliary tasks or domain adversarial networks to learn domain-invariant feature representations while ignore instance-based adaptation. In this work, we take both attributes into consideration within a joint framework based on BERT for domain adaptation of the ABSA task.

Approach
Overview: As discussed above, the domain differences mainly come from two attributes, namely feature discrepancy and instance discrepancy. Therefore, we approach cross-domain End2End ABSA and cross-domain aspect extraction with a Unified Domain Adaptation (UDA) framework encompassing two components, named feature-based and instance-based domain adaptation components, which are showed in Figure 1. To reduce the feature discrepancy, we introduce two auxiliary tasks based on the domain-shared knowledge. To reduce the instance discrepancy, we perform word-level instance weighting to focus more on important words for the target domain. Finally, we unified the two components in a sequential and joint manner.

Feature-Based Domain Adaptation
Structural correspondence learning (Ando and Zhang, 2005;Blitzer et al., 2007) is the core idea of feature-based domain adaptation, whose goal is to use the structural correspondence to narrow the gap between domains. As a self-supervised learning mechanism based on large-scale corpus, the mask language model task of BERT is essentially a structural correspondence learning method. However, it does not use pivot words as masked objects, but randomly selects words to mask and predict. Based on our preliminary observations, in both tasks of End2End ABSA and aspect extraction, although aspect words vary a lot across domains, there are still some universal language structure correspondence between domains such as part-of-speech tags and dependency relations, which can serve as pivots to connect the domains. Nevertheless, this information has not been explicitly captured by BERT. Motivated by this, we propose to use part-ofspeech information and dependency relations as self-supervised signals to fine-tune BERT to learn the structural correspondence between domains for cross-domain ABSA. The overall architecture for our feature-based domain adaptation component is shown in Figure 1(a).

Masked POS Tag Prediction
We first convert the word sequences w = {w 1 , w 2 , ..., w T } into continuous embedding e = {e 1 , e 2 , ..., e T }. The embedding of each word is the sum of four type embeddings e is the segment embedding, which is used as a segmentation mark between sentences. p i ∈ R d is the embedding for the absolute position of a word. tag i ∈ R d is the POS tag embedding.
The first three kinds of embedding are the same as those defined in Devlin et al. (2018), and are initialized using the pre-trained BERT embedding. The POS tag embedding matrix is randomly initialized and trained with unlabeled data from the source and target domains. Since BERT uses sub-word tokenizer, we assume that sub-words share the same POS tags. The word embedding sequences e = {e 1 , e 2 , ..., e T } were converted into a context-aware representation H = {h 1 , h 2 , ..., h T } through a multi-layer transformer as follows: To prepare the input for masked POS tag prediction task, we randomly select about 25% of tokens and replaced the original tokens and POS tags with [MASK]. After being encoded by transformer, the masked feature in H is fed into the softmax layer, and converted to the probability over POS tag types p pos i as follows: where p pos i ∈ R n tags , n tags is the number of POS tag type, W p and b p are the weight matrix and bias vector of the softmax layer. We only use the masked features for prediction and we use crossentropy loss for optimization: where I(i) is an indicator function, which is equal to 1 if masked, otherwise 0, and y pos i is the real POS tag type of the i-th token.

Dependency Relation Prediction
To reconstruct the syntactic relation in H that is useful for ABSA, we feed the context-aware repre-  can be viewed as the representations of the head token and child token in the dependency tree, respectively. Suppose the i-th and j-th words in the input sequence are connected in the dependency tree and represent the head node and the child node, respectively. We use o ij to predict their dependency relation: where [; ] indicates concatenation operation, − and indicate element-wise subtraction and multiplication, respectively. The o ij was converted into p dep ij by a softmax layer.
where W d ∈ R d×narc is the weight matrix for relation classification, and n arc is the number of relation classes. We use token pairs that are directly connected in the dependency tree to construct training examples. I(ij) indicates whether token pairs (i, j) have a direct edge in dependency tree or not. If they are connected in the dependency tree, we predict their dependency relation. The optimization objective is as follows: We perform feature-based domain adaptation through two auxiliary tasks, and optimize the following objective function for feature-based domain adaptation: where λ is a trade-off hyper-parameter to control the contributions of two tasks, and L pos and L dep are defined in Equation 3 and Equation 6 respectively.

Instance-Based Domain Adaptation
As analyzed above, instances-based domain adaptation aims to use pt ps to re-weight instances in the source domain to reduce the gap across domains. However, unlike the coarse-grained domain adaptation, our fine-grained ABSA tasks are modeled as sequence labeling tasks, which are essentially wordlevel classification problems. Since each sentence has domain-invariant words and domain-specific words, we need to obtain the domain distribution of each word and re-weight it at the word level.
Specifically, while training the main task, we also train a word-level domain classifier based on unlabeled data, whose goal is to identify whether each word is from the source domain or the target domain. The output of transformer H was then send to a softmax layer to get the domain distribution probability of the i-th word w i as follows: where p D i ∈ R |y n d | is the domain distribution probability and y n d = {source, target}. The domain classifier D is trained by the cross entropy loss between p D i and the ground-truth y D i as follows: Through the domain classifier D, we can get the domain distribution of each word, and we use the ratio of its target-domain probability to its sourcedomain probability, i.e., , as the weight of each word during training the main task. Since the training of domain classifiers will make it difficult to generalize across domains, we cut off the gradient back pass, so that L d only optimizes the parameters W d and b d in the softmax layer. As shown in Figure 1(b), when training D, the red dashed line represents the feed-forward calculation, but there is no gradient return. The main task is optimized with the weighted cross entropy loss as follows: where α i (i.e., the weight of each word) is computed based on the re-normalization over of all the T tokens, and the probability p D i,t and p D i,s are obtained by the domain classifier D.
Although AD-SAL (Li et al., 2019b) also learns an importance weight for each word, our method is significantly different. First of all, AD-SAL still essentially belongs to feature-based domain adaptation, and our method belongs to instance-based domain adaptation. For AD-SAL, the goal is to learn domain-invariant representations for each word through domain adversarial learning. As aspect words are the core of ABSA (this is also consistent with our motivation), AD-SAL introduces aspect attention weights in domain adversarial learning to learn domain-invariant representations for aspect words. In contrast, our method uses domain classifier to automatically learn the importance of each word for the target domain, so that it pays more attention to words (including aspect words and opinion words) that are closer to the target domain during the main task training process. Secondly, the training process of SAL is independent of the main task. In contrast, in our method, the weight of each word is learned through the domain classifier, and the learning process is combined with the main task, which will make the model automatically learn which words are more important for the target domain and the main task.

Training Mechanism
As analyzed before, our work mainly contains two components: feature-based and instance-based component, which was corresponding to the two attributes of domain adaptation respectively. To dynamically learn a weight for the instance-based component, L d (Equation 8) and L m (Equation 10) update jointly. The training objective of instancebased domain adaptation is as follows: The feature-based domain adaptation aims to learn a shared feature space for the target domain, which could be trained separately from the main task. Thus, we can merge the instance-based component and the feature-based component in a sequential or joint training way.
Sequential Training: In the sequential training, we first train the auxiliary tasks to learn a shared feature space, and the training objective is given in Equation 7. Based on the learned shared feature space, we then perform instance-based domain adaptation, and the training objective is given in Equation 11.
Joint Training: We can also merge the two components in a joint way, i.e., training auxiliary tasks and the main task in a multi-task manner. The training objective is: As revealed by Ando and Zhang (2005) and Blitzer et al. (2007), the success of the target task comes from multiple related tasks to help discover common structures between domains. As they are trained jointly, the information from auxiliary tasks could be propagated to the main task.

Data & Experiment Setup
Datasets: We conduct experiments on four benchmark datasets: Laptop (L), Restaurant(R), Device (D), and Service (S). L contains reviews from the laptop domain in SemEval-2014 ABSA challenge (Pontiki et al., 2014). Following the setup in Li et al. (2019a), R is the union set of the restaurant datasets from SemEval ABSA challenge 2014, 2015, and 2016 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. D is a combination of device reviews from Toprak et al. (2010) and S is introduced by Hu and Liu (2004) containing reviews from web services. Detailed statistics are shown in Table 1.
Settings & Implementation Details: We conduct experiments on 10 source→target pairs using the four domains above. Following the setup in (Li et al., 2019b), we removed D→L and L→D, as D and L are similar. For each source→target pair, the training data consists of the training data in the source domain and the unlabeled training data in the target domain. The evaluation results are obtained based on the test data from the target domain. We use Spacy to extract part-of-speech tags and dependency relations, and finally used 54 types  of part-of-speech tags and 47 types of dependency relation. For our proposed UDA approach, since it is a general DA framework, we can potentially use any pre-trained BERT model or their variants as our base model. In this work, we adopt two kinds of base models: BERT B and BERT E . For BERT B , it refers to the uncased BERT base model pre-trained by Devlin et al. (2018) 1 . For BERT E , it refers to an extended version of BERT B , which further incorporates the domain knowledge (Xu et al., 2019) by fine-tuning the pre-trained BERT B model with the BERT language model on product reviews from a combination of Yelp Challenge Datasets 2 and the Electronics dataset from Amazon 3 (He and McAuley, 2016). Note that for the BERT language model fine-tuning, we use 32 bit floating point computations using the Adam optimizer (Kingma and Ba, 2014). The batch size is set to 32, and the learning rate is set to 3 · 10 −5 . For training downstream tasks, we set λ to 0.1, and use the Adam optimizer. We perform grid search over a learning rate of 2 · 10 −5 , 3 · 10 −5 , 5 · 10 −5 and a batch size of 16, 32, 64. We tune all these parameters on the validation set, which is composed by 10% samples from the training set 4 .
Evaluation Metric: The evaluation metric we used is Micro-F1. Following the setting in existing work, only exact match could be counted as correct. All experiments are repeated 5 times and we report the average results over 5 runs.

Baselines & Main Results
We compare our Unified Domain Adaptation (UDA) approach with several highly competitive DA methods as follows: 1 We make use of the uncased BERTbase model as part of the pytorch-transformers library: https://github.com/ huggingface/pytorch-transformers 2 https://www.yelp.com/dataset/ challenge 3 http://jmcauley.ucsd.edu/data/amazon/ links.html 4 The source code and corpus can be obtained at https: //github.com/NUSTM/BERT-UDA   • Hier-Joint (Ding et al., 2017): A recurrent neural network (RNN) with manually designed rule-based auxiliary tasks. • RNSCN (Wang and Pan, 2018): A recursive neural structural correspondence network that incorporates syntactic structures. • AD-SAL (Li et al., 2019b): A recent deep model that achieves state-of-the-art performance on End2End ABSA across domains. • BERT B (Devlin et al., 2018) and BERT E (Xu et al., 2019): directly fine-tuning the two kinds of pre-trained models on the down-stream task. • BERT B -DANN and BERT E -DANN: We respectively use BERT B and BERT E as the base models, and simultaneously perform adversarial training on each word, which can be viewed as the BERT version of the widely used DANN approach proposed by Ganin et al. (2016).
The overall comparison results on cross-domain End2End ABSA are shown in Table 2. On the one hand, we can observe that BERT B -UDA generally performs better than the state-of-the-art DA approach (i.e., AD-SAL) on most transfer pairs for cross-domain End2End ABSA. Moreover, with BERT E as the base model, our BERT E -UDA approach can significantly boost the average performance of BERT B -UDA from 35.75% to 40.63%, which outperforms AD-SAL by 6.92% on average. On the other hand, by comparing BERT-based approaches, we can clearly see that simply performing adversarial training (i.e., DANN) for each word does not give satisfactory improvements over BERT B and BERT E , whereas our UDA approach can significantly outperform all the BERT-based baselines and consistently achieve the best performance on all the transfer pairs. All these observations demonstrate the effectiveness of our UDA framework.
We also report the results on cross-domain AE in Table 3. Clearly, we can find that the overall trend of the performance of each approach is similar to their performance in cross-domain End2End ABSA. But the results of End2End ABSA are much lower than those of AE, which is reasonable as AE is one of its sub-tasks. Compared with AD-SAL, our BERT E -UDA approach is 1.62% higher in terms of the average performance of all transfer pairs for the task of cross-domain AE. Compared with cross-domain End2End ABSA, the improve-

Ablation Study
Since our UDA framework includes two components, i.e., feature-based and instance-based domain adaptation, we further conduct experiments over different variants of the proposed model in Table 4 to show the effect of each component. Only Feature and Only Instance represent the featurebased domain adaptation and the instance-based domain adaptation on basis of BERT E , respectively. Compared with BERT E , both components have achieved much better F1 scores on most transfer pairs. This indicates that our proposed two components have effectively reduced the domain discrepancy. Besides, we also merge the two components in a sequential and joint way, denoted by Sequential and Joint respectively. It is easy to see that Joint performs slightly better than Sequential, which shows the advantages of joint optimization. To qualitatively show the effect of our wordlevel instance weighting method, we show the most important words for the target domain on three transfer pairs in Table 5. The results show that the common opinion words (e.g., beauty, amazement and satisfactory) or aspect words (e.g., employee, desk and kitchen) gain more weight in the wordlevel instance weighting.  Liu et al., 2015;Poria et al., 2016;Wang et al., 2016aLi et al., 2018a;Xu et al., 2018) and aspect-level sentiment classification (Dong et al., 2014;Tang et al., 2016;Wang et al., 2016b;Ma et al., 2017;Li et al., 2019c) have been extensively studied in the literature.
Since these two tasks are strongly related with each other, a number of previous studies propose to tackle them together in an end-to-end manner (Mitchell et al., 2013;Zhang et al., 2015). Some recent studies have further demonstrated that a unified tagging scheme can effectively eliminate the error propagation issue of traditional pipeline methods, and thus achieve the state-of-the-art performance. However, since annotating each word with fine-grained label is time-consuming, it is next to impossible to obtain enough annotated data for the ABSA task in every new domain. Therefore, in this work, we resort to transfer learning, and focus on proposing an effective domain adaptation approach for the ABSA task.
Existing domain adaptation studies in sentiment analysis primarily focus on coarse-grained domain adaptation problem. Most of them can be grouped into two categories: feature-based methods (Blitzer et al., 2007;Pan et al., 2010;Chen et al., 2012;Zhuang et al., 2015;Ganin et al., 2016;Li et al., 2018b) and instance-based methods (Jiang and Zhai, 2007;Bickel et al., 2007;Xia et al., 2013Xia et al., , 2014. The former one attempts to learn a domain-invariant representation with auxiliary tasks or domain adversarial learning, while the latter one tries to re-weight source instances in order to assign higher weights to instances similar to the target domain and lower weights to instances different from the target domain. Due to the challenges in fine-grained domain adaptation, only a few studies have explored the ABSA task in cross-domain settings. Ding et al. (2017) and Wang and Pan (2018) used domain general syntactic relations to construct auxiliary task to bridge the domains. Li et al. (2019b) proposed a selective adversarial learning method to learn domain-invariant representations for aspect words. However, these methods are still based on traditional networks such as LSTM, but fail to resort to recent pre-trained text encoders such as BERT. Moreover, all these methods only perform featurebased adaptation, but ignore instance-based adaptation. In contrast, our work aims to propose a unified feature and instance-based method based on BERT for cross-domain ABSA. Besides, it is worth noting that Rietzler et al. (2019) explored BERT for cross-domain aspect sentiment classification, where the aspect terms or categories are provided for both source and target domains. Different from their work, we primarily focus on the cross-domain End2End ABSA task in this work, which aims to first extract aspect terms followed by identifying the sentiment towards each detected aspect term.

Conclusion
In this paper, we explored the potential of BERT to domain adaptation, and proposed a unified feature and instance-based adaptation approach for both tasks of cross-domain End2End ABSA and cross-domain aspect extraction. In feature-based domain adaptation, we use domain-shared syntactic relations and POS tags to construct auxiliary tasks, which can help learn domain-invariant representations for domain adaptation. In instance-based domain adaptation, we employ a domain classifier to learn to assign appropriate weights for each word. Extensive experiments on four benchmark datasets demonstrate the superiority of our Unified Domain Adaptation (UDA) approach over existing methods in both cross-domain End2End ABSA and cross-domain aspect extraction.