SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis

Pre-trained language models have been widely applied to cross-domain NLP tasks like sentiment analysis, achieving state-of-the-art performance. However, due to the variety of users’ emotional expressions across domains, fine-tuning the pre-trained models on the source domain tends to overfit, leading to inferior results on the target domain. In this paper, we pre-train a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets, and utilize it for cross-domain sentiment analysis task without fine-tuning. We propose several pre-training tasks based on existing lexicons and annotations at both token and sentence levels, such as emoticons, sentiment words, and ratings, without human interference. A series of experiments are conducted and the results indicate the great advantages of our model. We obtain new state-of-the-art results in all the cross-domain sentiment analysis tasks, and our proposed SentiX can be trained with only 1% samples (18 samples) and it achieves better performance than BERT with 90% samples.


Introduction
Sentiment analysis has gained widespread attention from both industry and academia, which aims to judge the sentimental polarity of the given text (Liu, 2012). Most existing works heavily rely on labeled data to train separate sentiment classifiers for each domain, which are both expensive and timeconsuming to obtain (Socher et al., 2013). Therefore, cross-domain sentiment analysis has become a promising direction, which transfers (invariant) sentiment knowledge from the source domain to the target domain 1 .
The major challenge here is that language expressions for sentimental text usually vary across different domains. For instance, "fast" has a positive sentiment towards "service" in the restaurant domain (Figure 1), while in the laptop domain, "fast" expresses a negative sentiment for "power consumption". Furthermore, models trained on the source domain tend to overfit, since they learn domain-specific knowledge excessively. Therefore, many studies (Du et al., 2020;Ziser and Reichart, 2018;Li et al., 2018) propose to address this issue by extracting domain-invariant features.
Recently, pre-trained language models like BERT (Devlin et al., 2019) have achieved the state-of-theart performance on multiple sentiment analysis tasks (Hoang et al., 2019;Munikar et al., 2019;Raffel et al., 2019). However, when they are directly applied to cross-domain sentiment analysis (Du et al., 2020), two problems arise: 1) Existing pre-trained models focus on learning the semantic content via self-supervision strategies, while ignoring sentiment-specific knowledge at the pre-training phrase; 2) Yuanbin Wu and Liang He are the corresponding authors of this paper. This work was conducted when Jie Zhou was interning at Alibaba DAMO Academy. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/.
1 Usually, we assume that there are abundant labeled data in the source domain, while little or no in the target domain. Thus, the model is trained on source domain and tested on the target domain for this task.

Laptop Domain
This computer consumes power so fast :(. It was a bad experience. I will never buy it again.

Restaurant Domain
This is a beautiful place (^_^). The service is fast and food is delicious.
during the fine-tuning phase, pre-trained models may overfit the source domain by learning too much domain-specific sentiment knowledge, leading to degraded performance on the target domain.
To address the above-mentioned problems, we propose a sentiment-aware pre-trained model, named SENTIX, to learn the domain-invariant sentiment knowledge at the pre-training phase, and it does not need to be fine-tuned for the cross-domain tasks. In particular, we observe that many widely available review datasets contain rich sentiment information, which can be utilized to enhance the domain-invariant knowledge acquisition. Large-scale review datasets, such as Yelp and Amazon, consist of 240 million reviews across 30 domains, full of sentiment words, emoticons and ratings. Taking Figure 1 as an example, these reviews contain opinion words (like "bad", "beautiful") and emoticons (like ":(", "p^^q"), and their ratings are 1 and 5, respectively.
In order to obtain the above domain-invariant sentiment knowledge, we propose several sentimentaware pre-training objectives, including token and sentiment prediction. At the token level, the sentiment words and emoticons are masked with a higher rate than the general words to emphasize the sentiment knowledge, and we pre-train SENTIX to predict sentiment-aware words, emoticons, and token sentiments. At the sentence level, we introduce a rating prediction strategy to learn the sentiment knowledge based on the whole sentence.
We conduct extensive experiments on cross-domain sentiment analysis tasks to evaluate the effectiveness of SENTIX, and obtain state-of-the-art results on all settings. SENTIX achieves more than 90% accuracy over all cross-domain sentiment analysis datasets with only 1% samples, outperforming BERT trained with 90% samples. Through visualization of the feature representation, we observe that SENTIX significantly reduces the overfitting issue, while the in-domain tests prove that our SENTIX also obtains significant improvement over BERT for both the sentence-level and aspect-based sentiment classification tasks.
The main contributions of this paper can be summarized as follows: • We propose SENTIX for cross-domain sentiment classification to learn rich domain-invariant sentiment knowledge in large-scale unlabeled multi-domain data.
• We design several pre-training objectives at both token level and sentence level to learn such domaininvariant sentiment knowledge by masking and prediction.
• The experiments clearly show that SENTIX obtains the state-of-the-art performance for cross-domain sentiment analysis and requires less annotated data than BERT to reach equivalent performances.

Preliminaries
Reviews contain a lot of semi-supervised sentiment signals, such as sentiment words, emoticons and ratings, and large-scale review data can be obtained from online review websites like Yelp. This sentiment knowledge can help learning domain-common sentiment feature for the cross-domain task.
• Sentiment Words. Sentiment lexicon contains a lot of sentiment information and is widely used in sentiment analysis. The words in lexicon are regarded as sentiment words. The words in positive and negative sentiment lexicons are labeled as "P" and "N" respectively. Words out of the lexicons are labeled as "0". HowNet 2 and opinion lexicon (Hu and Liu, 2004) Figure 2: The framework of SENTIX. First, we design three sentiment masking strategies, including SWM (e.g., "cheerful", "good"), EM (e.g., ":)"), and GWM (e.g., "food"). Then, we propose four sentiment-aware prediction objectives from token level and sentence level.
• Emoticons. Emoticons are usually used by users in texts to express their emotion (Zhao et al., 2012). Each emoticon is made up of typographical symbols (e.g., ")", "(", ":", "D", "-") and denotes facial expressions. It can be read either sideways (e.g., a sad face ":-(") or normally (e.g.,a happy face "(ˆˆ)") (Hogenboom et al., 2013). We extract the emoticons via regular expression and keep the top-100 emoticons in corpus (Table 1). If the words are matched by regular expression, they are labeled as "E", otherwise "0". 3 • Rating. In addition to the above token-level sentiment knowledge, reviews contain sentence-level rating scores, which represent the overall sentiment polarities. Rating scores contain 5 level: very negative, negative, neutral, positive, and very positive. The scores' distribution is unbalanced, and we perform average sampling on the original data. Notably, labels of ratings are relatively difficult to be obtained than sentiment words and emoticons since only reviews data contain ratings. Therefore we study the ablation test in Section 5.3, which demonstrate that our model still performs better than the state-of-the-art approach without ratings.
To make full use of this rich sentiment knowledge for cross-domain sentiment analysis, we design several pre-training objectives to enhance the model with domain-invariant sentiment knowledge.

SENTIX
SENTIX is a sentiment-aware pre-training model for cross-domain sentiment analysis. It learns domaininvariant features from the above domain-invariant sentiment knowledge, including sentiment lexicons, emoticons, and ratings. The framework contains sentiment masking and pre-training objectives, as shown in Figure 2. Sentiment masking (Section 3.1) recognizes the sentiment information of an input sequence from sentiment knowledge. Pre-training objectives require encoder not only reconstruct the masked sentiment tokens, but also distinguish the word sentiment polarity, emoticon and rating (Section 3.2).
Formally, given a sentence x " tx 1 , x 2 , ..., x |x| u, we first obtain a corrupted sentencex " tx 1 ,x 2 , ...,x |x| u (x PX , whereX is the corrupted corpus.) via sentiment masking. The sentimentaware pre-training tasks are proposed to predict word x i , sentiment s i , emoticon e i of in token level and rating r in sentence level. Here s i P tP, N, 0u represents the sentiment polarity (positive, negative, others) of word x i , e i P tE, 0u indicates whether word x i is an emoticon, and r P t1, 2, 3, 4, 5u is the rating of x.

Sentiment Masking
Sentiment masking aims to enhance the sentiment information at the token level. Previous pre-trained models adopt masked language modeling (MLM) to learn semantic information. Some input tokens are randomly masked, and the goal is to predict these masked tokens. In addition to this general word masking, we propose sentiment word masking and emoticon masking for learning sentiment knowledge through recovering.
• Sentiment Word Masking (SWM). To enrich the sentiment information, we mask the sentiment words with 30% rate as these words are important for sentiment analysis 4 .
• Emoticon Masking (EM). Since the number of emoticons in one sentence is relatively small and deleting emoticons will not influence the semantic information of the sentence, we mask 50% emoticons for each sentence.
• General Word Masking (GWM). If we only focus on the sentiment words and emoticons, SENTIX may lose the general semantic information of the other words. Thus, following the original BERT, we use [MASK] to replace the general word in sentence with 15% rate to learn the semantic information.

Pre-training Objectives
Sentiment masking produces corrupted sentencesx where part of the sentiment words, emoticons and general words are substituted with masked tokens. Three token-level and one sentence-level prediction objectives are designed to learn the domain-invariant sentiment knowledge from the pre-training phase.
Sentiment-aware Word Prediction (SWP) Based on our sentiment masking strategies, the corrupted tokens that contain extra sentiment words and emoticons are obtained to capture the sentiment information. The corrupted sentencex is input to transformer encoder to obtain each word representations h i and sentence representation h rCLSs . Then a Softmax layer is used to compute each word's probability P px i |x i q " SoftmaxpW w¨hi`bw q. The loss function L w is the cross-entropy between the predicted probability and the true word label.
Word Sentiment Prediction (WSP) According to the sentiment knowledge, we label the word's sentiment into positive, negative and others. Thus, we design WSP for learning the sentiment knowledge of the tokens. We aim to infer the sentiment polarity s i of word x i according to h i , P ps i |x i q " SoftmaxpW s¨hi`bs q. The cross-entropy loss L s "´1 Emoticon Prediction (EP) To further capture the token-level sentiment knowledge, we propose to predict the emoticon label e i of word Rating Prediction (RP) Above tasks focus on learning the token-level sentiment knowledge. Ratings represent the overall sentiment score of the reviews in sentence level. Inferring the rating will bring in the sentence-level sentiment knowledge. Similar to BERT, we use the final state h rCLSs as the sentence representation. The rating is predicted by P pr|xq " SoftmaxpW r¨hrCLSs`br q and the loss is calculated based on the predicted rating distribution,

Joint Training
Finally, we jointly optimize the token-level objective L T and the sentence-level objective L S . The overall loss is L " L T`LS , where L T " L w`Ls`Le and L S " L r .
4 Experimental Setup

Datasets
Pre-training The pre-training phase is conducted on two large-scale datasets: Amazon review dataset (Ni et al., 2019) and Yelp 2020 challenge dataset 5 . Amazon dataset contains 233 million reviews within 29 domains (Ni et al., 2019). The total number of yelp reviews is about 8 million. We preprocess the text via NLTK 6 and transfer all the letters into lower. We filter the text that contains less than 50 tokens or more than 512 tokens and sample the rating data in dealing with class-imbalance problem. The statistics information of top-20 emoticons is shown in Table 1.  (Maas et al., 2011) and Yelp (Zhang et al., 2015), and three aspect-based sentiment classification datasets: Restaurant14 (Pontiki et al., 2016) and Laptop14 (Pontiki et al., 2016), and Twitter dataset (Dong et al., 2014).

Baselines
We compare our model with the following strong baselines for cross-domain sentiment analysis, including DANN (Ganin et al., 2016), PBLM (Ziser and Reichart, 2018), HATN (Li et al., 2018), ACAN (Qu et al., 2019), IATN (Zhang et al., 2019a) and BERT-DAAT (Du et al., 2020). BERT-DAAT is regarded as the state-of-the-art model, which uses BERT for cross-domain sentiment analysis with adversarial training. We adopt the results of these baselines reported in (Du et al., 2020). For in-domain sentiment analysis, we compare our model with SentiLR-B (Ke et al., 2019), which is one of the state-of-the-art models based on BERT.
BERT is extensively compared in our experiments. To exclude the impact of the pre-training dataset, we also compare SENTIX with BERT˚, which pre-trains on the same dataset with standard MLM task. Moreover, to verify that SENTIX learns the sentiment knowledge in pre-training phase, we conduct our experiments with the fixed parameters of pre-trained models (marked with " F ix "). In other words, we adopt pre-trained models as feature extractors and their parameters are not updated in fine-tuning phase.

Settings
Pre-training For pre-training phase, we use BERT as the base model and train SENTIX for 3 epochs over all the reviews data. The batch size is 256. Adam is adopted and the learning rate is set to 2e-5.  Sentiment Analysis Similar to BERT (Devlin et al., 2019), we adopt the same settings for the downstream tasks. Specifically, we input sequence trCLSs, w 1 , ..., w n , rSEPsu into pre-trained model. The last state of rCLSs is used as the sequence representation for classification. For aspect-based sentiment analysis, the text and aspect are concatenated as input. We search for the best random seed and learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) for BERT since it is not stable. While, SENTIX is much more stable and we run it with fixed seed and learning rate (2e-5). We tune SENTIX on downstream tasks with 15 epochs and keep the best model on development. Accuracy is adopted as metric for all these tasks.

Results and Analysis
In this section, we conduct a series of experiments to validate the performance of SENTIX. First, we test our model on in-domain sentiment analysis tasks (Section 5.1) to prove that SENTIX performs well for sentiment analysis. Second, to verify the effectiveness of SENTIX, we conduct extensive experiments on cross-domain sentiment analysis (Section 5.2). Third, we also perform an ablation test to investigate the effectiveness of each component (Section 5.3). Forth, we explore the influence of the number of training samples (Section 5.4) and visualize the feature representations of SENTIX (Section 5.5). Finally, we investigate the time complexity, space complexity, and convergence analysis of SENTIX (Section 5.6).

In-Domain Sentiment Analysis
We test SENTIX on two in-domain sentiment analysis tasks (sentiment classification and aspect-based sentiment classification) to verify the effectiveness of SENTIX (Table 2). We observe that: 1) SENTIX performs better than the state-of-the-art model SentiLR-B in most cases, which is also based on pretrained BERT. For Yelp, our model is not as good as (Ke et al., 2019) since it is pre-trained on the yelp review data directly. Different from (Ke et al., 2019), we pre-train our model on 30 domain datasets and the part-of-speech information is not used in our model; 2) Compared with BERT-based baselines, SENTIX obtains a significant improvement. In particular, SENTIX (SENTIX large ) performs much better than BERT (BERT large ) and BERT˚over all seven datasets. Our pre-training objectives are effective to learn the sentiment knowledge from pre-training data; 3) To answer how much sentiment knowledge is learned from the pre-training, we fix the parameters of SENTIX, and the results show that SENTIX F ix learns the sentiment information from large-scale dataset well, while BERT F ix performs only a little better than the random baseline does.

Cross-Domain Sentiment Analysis
Apart from the in-domain tasks, we conduct cross-domain sentiment analysis experiments as well. SEN-TIX is tuned on source domain and tested on the target domain. From Table 3, we obtain the following observations: • Compared with other works, SENTIX and SENTIX F ix achieve the best performance. SENTIX F ix shows superior performance across all the 12 cross-domain tasks, and improves 2.56 absolute points on average over the previous the-state-of-art method (BERT-DAAT). It demonstrates that SENTIX has learned the domain-invariant knowledge and transferred the sentiment knowledge from the source to Source Ñ Target BÑD BÑE BÑK DÑB DÑE DÑK EÑB EÑD EÑK KÑB KÑD KÑE Avg DANN (Ganin et al., 2016) 82  (Ziser and Reichart, 2018) (Li et al., 2018) 86  (Zhang et al., 2019a) 86  (Du et al., 2020)   the target domain. In particular, sentiment words, emoticons and ratings from reviews are transferable signals across all domains.
• The performance of the BERT-based models is listed in the second group. BERT F ix only achieves 54.86% accuracy on average, which is consistent with in-domain experiments. Compared with BERT˚, which is also pre-trained on the review dataset, SENTIX improves 2.8 absolute points, and we attribute it to the proposed sentiment masking and sentiment-aware pre-training objectives.
• SENTIX F ix performs better than SENTIX (1.21 absolute improvement on average). SENTIX F ix adopts the pre-trained model as feature extractor and does not update its parameters during fine-tuning, while SENTIX fine-tunes the parameters. We speculate that during fine-tuning, SENTIX learns too much domain-specific sentiment knowledge in the source domain, leading to degraded performance on the target domain. Overall, SENTIX effectively learns domain-invariant sentiment knowledge from large-scale unlabeled data and it serves as a decent sentiment feature extractor.

Ablation Study
We conduct ablation study to investigate the influence of different components from two perspectives: we remove -Sentiment, -Emoticon, and -Rating respectively to evaluate the impact of each sentimentrelated pre-training task; and we remove -Token and -Sentence respectively to compare the different granularity.
-Token indicates that we remove sentiment words and emoticons in the pre-training phase; and -Sentence contains only the -Rating, which excludes RP. Table 4 lists the results and we observe that: First, each sentiment knowledge (sentiment lexicon, emoticon, and rating) improves the performance of sentiment analysis. Second, without rating prediction (-Rating), our model still performs better than the state-of-the-art model (BERT-DAAT). Third, since cross-domain sentiment analysis focuses on sentence-level sentiment, SENTIX F ix without sentence level objectives (-Rating) does not perform well, while SENTIX can still learn the sentence-level sentiment information from token level objectives through fine-tuning.

Influence of Sample Numbers
To study the learning curve in source domain, we test the performance of SENTIX, SENTIX F ix , BERT and BERT F ix on target domain with different rates of training samples (Figure 3). First, we find that our model can be trained with only 1% samples (18 samples), while BERT does not work well with such limited data size. Furthermore, SENTIX with 1% samples even performs better than BERT with 90% samples. All these observations denote that SENTIX can reduce the training sample number significantly. Second, SENTIX F ix obtains better results than SENTIX, while BERT F ix has a poor performance. This indicates that the representation of SENTIX contains much more sentiment knowledge than the standard BERT does.

Visualization of Representation
To understand why our SENTIX work, we visualize the sentence representations of BERT and SENTIX for B Ñ E task (Figure 4). In other words, the representations of source data points (books) and target data points (electronic) with positive and negative sentiment labels are provided. In particular, we convert the 768-dimensional features into two-dimension via t-SNE. From these figures, we obtain the following observations. First, we find that the sentences with different sentiment polarities are clearly separated in the source domain for BERT. However, some data points are mixed in the target domain. This indicates that fine-tuning the BERT overfits in the source domain. Besides, the representations of BERT F ix can hardly split the positive samples from the negative ones. Second, SENTIX performs well on both of source and target domains, even it overfits in the source domain a little. The samples can be easily separated on both domains for SENTIX F ix , though the difference between positive and negative samples is not as significant as SENTIX in the source domain. These demonstrate that SENTIX has learned rich sentiment knowledge via pre-training tasks and avoided overfitting to a large extent.

Complexity and Convergence Analysis
In this section, we investigate the time complexity, space complexity and convergence of SENTIX on B Ñ E task (

Related Work
Cross-domain Sentiment Analysis Due to the heavy cost of obtaining large quantities of labeled data for each domain, many approaches have been proposed for cross-domain sentiment analysis (Blitzer et al., 2007;Yu and Jiang, 2016;Li et al., 2013;Zhang et al., 2019a;Peng et al., 2018). Most of the previous works focus on capturing the pivots that are useful for both source domain and target domain (Ziser and Reichart, 2018;Li et al., 2018). Domain adaptation adversarial training (Ganin et al., 2016) is widelyused to learn the domain-common sentiment knowledge (Li et al., 2017;Qu et al., 2019). Recently, Du et al. (2020) integrated BERT into cross-domain sentiment analysis tasks to learn the domain-shared feature representation. However, most of the existing work focuses on learning the domain-shared representation in training or fine-tuning, how to learn domain-invariant sentiment knowledge from the pre-training phase has not been explored.

Pre-trained Model
Existing studies (Peters et al., 2018;Devlin et al., 2019) have proved that pretraining on large-scale unlabelled corpus obtains state-of-the-art performances in the field of natural language processing (Qiu et al., 2020). On the one hand, many studies applied pre-trained models to downstream tasks via fine-tuning (Devlin et al., 2019;Dodge et al., 2020;Xu et al., 2019). Devlin et al. (2019) fine-tuned the BERT model on many downstream tasks, such as name entity recognition and sentiment analysis. Sun et al. (2019a) converted aspect-based sentiment analysis task into a sentence pair classification task to better utilize the powerful representation of BERT. On the other hand, some work proposed to add external knowledge into pre-training BERT to enhance the representations (Zhang et al., 2020). LIBERT (Lauscher et al., 2019) integrated linguistic knowledge through an additional linguistic constraint task. ERINE (Zhang et al., 2019b) and KnowBERT (Peters et al., 2019) integrated entity representation into BERT. Alternatively, Levine et at. (2019) introduced a SenseBERT to improve lexical understanding by predicting tokens' supersenses in WordNet. Tian et al. (2020) and Ke et al. (2019) integrated external knowledge to learn sentiment information. They focused on improving the performance with fine-tuning on downstream sentiment analysis tasks by training on a relatively small or one domain dataset. Different from the existing studies, we design several pre-training objectives via rich domain-invariant sentiment knowledge in large-scale multi-domain unlabeled data for cross-domain sentiment analysis.

Conclusions
In this paper, we pre-train our SENTIX model to induce a general low dimensional representation based on domain-invariant sentiment knowledge for cross-domain sentiment analysis. In particular, we design several pre-training tasks to learn the sentiment knowledge from semi-supervised labels (such as sentiment words, emoticons, and ratings) based on sentiment masking. SENTIX obtains the state-of-the-art performance on 12 cross-domain sentiment analysis tasks. The visualization of the feature representation indicates that SENTIX can reduce overfitting in the source domain. The experimental results also show that SENTIX requires much less labeled data, training time and trainable parameters to obtain the equivalent performances of the standard BERT.
In the future, we are interested in exploiting more diverse pre-training datasets (e.g., twitter) and more kinds of sentiment knowledge. We also think that more self-supervised objectives could be investigated for the cross-domain sentiment analysis tasks.