SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

Recently, sentiment analysis has seen remarkable advance with the help of pre-training approaches. However, sentiment knowledge, such as sentiment words and aspect-sentiment pairs, is ignored in the process of pre-training, despite the fact that they are widely used in traditional sentiment analysis approaches. In this paper, we introduce Sentiment Knowledge Enhanced Pre-training (SKEP) in order to learn a unified sentiment representation for multiple sentiment analysis tasks. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair. Experiments on three kinds of sentiment tasks show that SKEP significantly outperforms strong pre-training baseline, and achieves new state-of-the-art results on most of the test datasets. We release our code at https://github.com/baidu/Senta.


Introduction
Sentiment analysis refers to the identification of sentiment and opinion contained in the input texts that are often user-generated comments. In practice, sentiment analysis involves a wide range of specific tasks (Liu, 2012), such as sentence-level sentiment classification, aspect-level sentiment classification, opinion extraction and so on. Traditional methods often study these tasks separately and design specific models for each task, based on manuallydesigned features (Liu, 2012) or deep learning (Zhang et al., 2018).
Recently, pre-training methods (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019; have shown their powerfulness in learning general semantic representations, and have remarkably improved most natural language processing (NLP) tasks like sentiment analysis. These methods build unsupervised objectives at word-level, such as masking strategy (Devlin et al., 2019), next-word prediction (Radford et al., 2018) or permutation . Such wordprediction-based objectives have shown great abilities to capture dependency between words and syntactic structures (Jawahar et al., 2019). However, as the sentiment information of a text is seldom explicitly studied, it is hard to expect such pre-trained general representations to deliver optimal results for sentiment analysis (Tang et al., 2014). Sentiment analysis differs from other NLP tasks in that it deals mainly with user reviews other than news texts. There are many specific sentiment tasks, and these tasks usually depend on different types of sentiment knowledge including sentiment words, word polarity and aspect-sentiment pairs. The importance of these knowledge has been verified by tasks at different level, for instance, sentence-level sentiment classification (Taboada et al., 2011;Shin et al., 2017;Lei et al., 2018), aspect-level sentiment classification (Vo and Zhang, 2015;Zeng et al., 2019), opinion extraction (Li and Lam, 2017;Gui et al., 2017;Fan et al., 2019) and so on. Therefore, we assume that, by integrating these knowledge into the pre-training process, the learned representation would be more sentimentspecific and appropriate for sentiment analysis.
In order to learn a unified sentiment representation for multiple sentiment analysis tasks, we propose Sentiment Knowledge Enhanced Pre-training (SKEP), where sentiment knowledge about words, polarity, and aspect-sentiment pairs are included to guide the process of pre-training. The sentiment knowledge is first automatically mined from unlabeled data (Section 3.1). With the knowledge Transformer Encoder [MASK] came this [MASK] and really [MASK] it I x 3 x 4 x 6 x 7 x 5 x 9 x 10 x 8 product fast appreciated product came this fast and really appreiated it I aspect-sentiment pair sentiment word

Sentiment Prediction
x 2 x 1 Sentiment Masking Figure 1: Sentiment Knowledge Enhanced Pre-training (SKEP). SKEP contains two parts: (1) Sentiment masking recognizes the sentiment information of an input sequence based on automatically-mined sentiment knowledge, and produces a corrupted version by removing these informations. (2) Sentiment pre-training objectives require the transformer to recover the removed information from the corrupted version. The three prediction objectives on top are jointly optimized: Sentiment Word (SW) prediction (on x 9 ), Word Polarity (SP) prediction (on x 6 and x 9 ), Aspect-Sentiment pairs (AP) prediction (on x 1 ). Here, the smiley denotes positive polarity. Notably, on x 6 , only SP is calculated without SW, as its original word has been predicted in the pair prediction on x 1 . mined, sentiment masking (Section 3.2) removes sentiment information from input texts. Then, the pre-training model is trained to recover the sentiment information with three sentiment objectives (Section 3.3).
SKEP integrates different types of sentiment knowledge together and provides a unified sentiment representation for various sentiment analysis tasks. This is quite different from traditional sentiment analysis approaches, where different types of sentiment knowledge are often studied separately for specific sentiment tasks. To the best of our knowledge, this is the first work that has tackled sentiment-specific representation during pretraining. Overall, our contributions are as follows: • We propose sentiment knowledge enhanced pre-training for sentiment analysis, which provides a unified sentiment representation for multiple sentiment analysis tasks.
• Three sentiment knowledge prediction objectives are jointly optimized during pre-training so as to embed sentiment words, polarity, aspect-sentiment pairs into the representation. In particular, the pair prediction is converted into multi-label classification to capture the dependency between aspect and sentiment.
• SKEP significantly outperforms the strong pre-training methods RoBERTa  on three typical sentiment tasks, and achieves new state-of-the-art results on most of the test datasets.

Background: BERT and RoBERTa
BERT (Devlin et al., 2019) is a self-supervised representation learning approach for pre-training a deep transformer encoder (Vaswani et al., 2017). BERT constructs a self-supervised objective called masked language modeling (MLM) to pre-train the transformer encoder, and relies only on large-size unlabeled data. With the help of pre-trained transformer, downstream tasks have been substantially improved by fine-tuning on task-specific labeled data. We follow the method of BERT to construct masking objectives for pre-training. BERT learns a transformer encoder that can produce a contextual representation for each token of input sequences. In reality, the first token of an input sequence is a special classification token [CLS]. In fine-tuning step, the final hidden state of [CLS] is often used as the overall semantic representation of the input sequence.
In order to train the transformer encoder, MLM is proposed. Similar to doing a cloze test, MLM predicts the masked token in a sequence from their placeholder. Specifically, parts of input tokens are randomly sampled and substituted. BERT uniformly selects 15% of input tokens. Of these sampled tokens, 80% are replaced with a special masked token [MASK], 10% are replaced with a random token, 10% are left unchanged. After the construction of this noisy version, the MLM aims to predict the original tokens in the masked positions using the corresponding final states.
Most recently, RoBERTa  significantly outperforms BERT by robust opti-mization without the change of neural structure, and becomes one of the best pre-training models. RoBERTa also removes the next sentence prediction objective from standard BERT. To verify the effectiveness of our approach, this paper uses RoBERTa as a strong baseline.

SKEP: Sentiment Knowledge
Enhanced Pre-training We propose SKEP, Sentiment Knowledge Enhanced Pre-training, which incorporates sentiment knowledge by self-supervised training. As shown in Figure 1, SKEP contains sentiment masking and sentiment pre-training objectives. Sentiment masking (Section 3.2) recognizes the sentiment information of an input sequence based on automaticallymined sentiment knowledge (Section 3.1), and produces a corrupted version by removing this information. Three sentiment pre-training objectives (Section 3.3) require the transformer to recover the sentiment information for the corrupted version. Formally, sentiment masking constructs a corrupted version X for an input sequence X guided by sentiment knowledge G. x i and x i denote the i-th token of X and X respectively. After masking, a parallel data ( X, X) is obtained. Thus, the transformer encoder can be trained with sentiment pre-training objectives that are supervised by recovering sentiment information using the final states of encoder x 1 , ..., x n .

Unsupervised Sentiment Knowledge Mining
SKEP mines the sentiment knowledge from unlabeled data. As sentiment knowledge has been the central subject of extensive research, SKEP finds a way to integrate former technique of knowledge mining with pre-training. This paper uses a simple and effective mining method based on Pointwise Mutual Information (PMI) (Turney, 2002). PMI method depends only on a small number of sentiment seed words and the word polarity WP(s) of each seed word s is given. It first builds a collection of candidate word-pairs where each word-pair contains a seed word, and meet with pre-defined part-of-speech patterns as Turney (2002). Then, the co-occurrence of a word-pair is calculated by PMI as follows: Here, p(.) denotes probability estimated by count. Finally, the polarity of a word is determined by the difference between its PMI scores with all positive seeds and that with all negative seeds.
If WP(w) of a candidate word w is larger than 0, then w is a positive word, otherwise it is negative. After mining sentiment words, aspect-sentiment pairs are extracted by simple constraints. An aspectsentiment pair refers to the mention of an aspect and its corresponding sentiment word. Thus, a sentiment word with its nearest noun will be considered as an aspect-sentiment pair. The maximum distance between the aspect word and the sentiment word of a pair is empirically limited to no more than 3 tokens.
Consequently, the mined sentiment knowledge G contains a collection of sentiment words with their polarity along with a set of aspect-sentiment pairs. Our research focuses for now the necessity of integrating sentiment knowledge in pre-training by virtue of a relatively common mining method. We believe that a more fine-grained method would further improve the quality of knowledge, and this is something we will be exploring in the nearest future.

Sentiment Masking
Sentiment masking aims to construct a corrupted version for each input sequence where sentiment information is masked. Our sentiment masking is directed by sentiment knowledge, which is quite different from previous random word masking. This process contains sentiment detection and hybrid sentiment masking that are as follows.
Sentiment Detection with Knowledge Sentiment detection recognizes both sentiment words and aspect-sentiment pairs by matching input sequences with the mined sentiment knowledge G.
1. Sentiment Word Detection. The word detection is straightforward. If a word of an input sequence also occurs in the knowledge base G, then this word is seen as a sentiment word.
2. Aspect-Sentiment Pair Detection. The detection of an aspect-sentiment pair is similar to its mining described before. A detected sentiment word and its nearby noun word are considered as an aspect-sentiment pair candidate, and the maximum distance of these two words is limited to 3. Thus, if such a candidate is also found in mined knowledge G, then it is considered as an aspect-sentiment pair.
Hybrid Sentiment Masking Sentiment detection results in three types of tokens for an input sequence: aspect-sentiment pairs, sentiment words and common tokens. The process of masking a sequence runs in following steps: 1. Aspect-sentiment Pair Masking. At most 2 aspect-sentiment pairs are randomly selected to mask. All tokens of a pair are replaced by [MASK] simultaneously. This masking provides a way for capturing the combination of an aspect word and a sentiment word.
2. Sentiment Word Masking. For those unmasked sentiment words, some of them are randomly selected and all the tokens of them are substituted with [MASK] at the same time.
The total number of tokens masked in this step is limited to be less than 10%.

Common Token
Masking. If the number of tokens in step 2 is insufficient, say less than 10%, this would be filled during this step with randomly-selected tokens. Here, random token masking is the same as RoBERTa. 1

Sentiment Pre-training Objectives
Sentiment masking produces corrupted token sequences X, where their sentiment information is substituted with masked tokens. Three sentiment objectives are defined to tell the transformer encoder to recover the replaced sentiment information. The three objectives, Sentiment Word (SW) prediction L sw , Word Polarity (WP) prediction L wp and Aspect-sentiment Pair (AP) prediction L ap are jointly optimized. Thus, the overall pretraining objective L is: 1 For each sentence, we would always in total mask 10% of its tokens at step 2 and 3. Among these masked tokens, 79.9% are sentiment words (during step 2) and 20.1% are common words (during step 3) in our experiment. Sentiment Word Prediction Sentiment word prediction is to recover the masked tokens of sentiment words using the output vector x i from transformer encoder. x i is fed into an output softmax layer, which produces a normalized probability vectorŷ i over the entire vocabulary. In this way, the sentiment word prediction objective L sw is to maximize the probability of original sentiment word x i as follows:ŷ Here, W and b are the parameters of the output layer. m i = 1 if i-th position of a sequence is masked sentiment word 2 , otherwise it equals to 0. y i is the one-hot representation of the original token x i . Regardless of a certain similarity to MLM of BERT, our sentiment word prediction has a different purpose. Instead of predicting randomly masking tokens, this sentiment objective selects those sentiment words for self-supervision. As sentiment words play a key role in sentiment analysis, the representation learned here is expected to be more suitable for sentiment analysis.

Word Polarity Prediction
Word polarity is crucial for sentiment analysis. For example, traditional lexicon-based model (Turney, 2002) directly utilizes word polarity to classify the sentiment of texts. To incorporate this knowledge into the encoder, an objective called word polarity prediction L wp is further introduced. L wp is similar to L sw . For each masked sentiment token x i , L wp calculated its polarity (positive or negative) using final state x i . Then the polarity of target corresponds to the polarity of the original sentiment word, which can be found from the mined knowledge.
Aspect-sentiment Pair Prediction Aspect sentiment pairs reveal more information than sentiment words do. Therefore, in order to capture the dependency between aspect and sentiment, an aspectsentiment pair objective is proposed. Especially, words in a pair are not mutually exclusive. This is quite different from BERT, which assumes tokens can be independently predicted.
We thus conduct aspect-sentiment pair prediction with multi-label classification. We use the final state of classification token [CLS], which denotes representation of the entire sequence, to predict pairs. sigmoid activation function is utilized, which allows multiple tokens to occur in the output at the same time. The aspect-sentiment pair objective L ap is denoted as follows: Here, x 1 denotes the output vector of [CLS]. A is the number of masked aspect-sentiment pairs in a corrupted sequence.ŷ a is the word probability normalized by sigmoid. y a is the sparse representation of a target aspect-sentiment pair. Each element of y a corresponds to one token of the vocabulary, and equals to 1 if the target aspect-sentiment pair contains the corresponding token. 3 As there are multiple elements of y a equals to 1, the predication here is multi-label classification. 4

Fine-tuning for Sentiment Analysis
We verify the effectiveness of SKEP on three typical sentiment analysis tasks: sentence-level sentiment classification, aspect-level sentiment classification, and opinion role labeling. On top of the pre-trained transformer encoder, an output layer is added to perform task-specific prediction. The neural network is then fine-tuned on task-specific labeled data.

Sentence-level Sentiment Classification
This task is to classify the sentiment polarity of an input sentence. The final state vector of classification token [CLS] is used as the overall representation of an input sentence. On top of the transformer encoder, a classification layer is added to calculate the sentiment probability based on the overall representation.
Aspect-level Sentiment Classification This task aims to analyze fine-grained sentiment for an aspect when given a contextual text. Thus, there are two parts in the input: aspect description and

Opinion Role Labeling
This task is to detect fine-grained opinion, such as holder and target, from input texts. Following SRL4ORL (Marasović and Frank, 2018), this task is converted into sequence labeling, which uses BIOS scheme for labeling, and a CRF-layer is added to predict the labels. 5

Dataset and Evaluation
A variety of English sentiment analysis datasets are used in this paper.   2014 Task4 (Pontiki et al., 2014). This task contains both restaurant domain and laptop domain, whose accuracy is evaluated separately.
(3) For opinion role labeling, MPQA 2.0 dataset (Wiebe et al., 2005;Wilson, 2008) is used. MPQA aims to extract the targets or the holders of the opinions. Here we follow the method of evaluation in SRL4ORL (Marasović and Frank, 2018), which is released and available online. 4-folder crossvalidation is performed, and the F-1 scores of both holder and target are reported.
To perform sentiment pre-training of SKEP, the training part of Amazon-2 is used, which is the largest dataset among the list in Table 1. Notably, the pre-training only uses raw texts without any sentiment annotation. To reduce the dependency on manually-constructed knowledge and provide SKEP with the least supervision, we only use 46 sentiment seed words. Please refers to the appendix for more details about seed words.

Experiment Setting
We use RoBERTa  as our baseline, which is one of the best pre-training mod-els. Both base and large versions of RoBERTa are used. RoBERTa base and RoBERTa large contain 12 and 24 transformer layers respectively. As the pre-training method is quite costly in term of GPU resources, most of the experiments are done on RoBERTa base , and only the main results report the performance on RoBERTa large .
For SKEP, the transformer encoder is first initialized with RoBERTa, then is pre-trained on sentiment unlabeled data. An input sequence is truncated to 512 tokens. Learning rate is kept as 5e − 5, and batch-size is 8192. The number of epochs is set to 3. For the fine-tuning of each dataset, we run 3 times with random seeds for each combination of parameters (Table 2), and choose the medium checkpoint for testing according to the performance on the development set.

Main Results
We compare our SKEP method with the strong pretraining baseline RoBERTa and previous SOTA. The result is shown in Table 3.
Comparing with RoBERTa, SKEP significantly and consistently improves the performance on both From Model Sentence Samples Prediction

SST-2
RoBERTa altogether , this is ::::::::: successful as a film , while at the same time being a most touching reconsideration of the familiar :::::::::: masterpiece . positive SKEP altogether , this is ::::::::: successful as a film , while at the same time being a most touching reconsideration of the familiar :::::::::: masterpiece . positive Sem-L RoBERTa I got this at an ::::::: amazing price from Amazon and it arrived just in time . negative SKEP I got this at an ::::::: amazing price from Amazon and it arrived just in time . positive  base and large settings. Even on RoBERTa large , SKEP achieves an improvement of up to 2.4 points. According to the task types, SKEP achieves larger improvements on fine-grained tasks, aspect-level classification and opinion role labeling, which are supposed to be more difficult than sentencelevel classification. We think this owes to the aspect-sentiment knowledge that is more effective for these tasks. Interestingly, "RoBERTa base + SKEP" always outperforms RoBERTa large , except on Amazon-2. As the large version of RoBERTa is computationally expensive, the base version of SKEP provides an efficient model for application. Compared with previous SOTA, SKEP achieves new state-of-the-art results on almost all datasets, with a less satisfactory result only on SST-2.
Overall, through comparisons of various sentiment tasks, the results strongly verify the necessity of incorporating sentiment knowledge for pretraining methods, and also the effectiveness of our proposed sentiment pre-training method.

Detailed Analysis
Effect of Sentiment Knowledge SKEP uses an additional sentiment data for further pre-training and utilizes three objectives to incorporate three types of knowledge. Table 4 compares the contributions of these factors. Further pre-training with random sub-word masking of Amazon, Roberta base obtains some improvements. This proves the value of large-size task-specific unlabeled data. However, the improvement is less evident compared with sentiment word masking. This indicates that the importance of sentiment word knowledge. Further improvements are obtained when word polarity and aspect-sentiment pair objectives are added, confirming the contribution of both types of knowledge. Compare "+SW+WP+AP" with "+Random Token", the improvements are consistently significant in all evaluated data and is up to about 1.5 points.
Overall, from the comparison of objectives, we conclude that sentiment knowledge is helpful, and more diverse knowledge results in better performance. This also encourages us to use more types of knowledge and use better mining methods in the future.
Effect of Multi-label Optimization Multi-label classification is proposed to deal with the dependency in an aspect-sentiment pair. To confirm the necessity of capturing the dependency of words in the aspect-sentiment pair, we also compare it with the method where the token is predicted independently, which is denoted by AP-I. AP-I uses softmax for normalization, and independently predicts each word of a pair as the sentiment word prediction. According to the last line that contains AP-I in Table 4, predicting words of a pair independently do not hurt the performance of sentence-level classification. This is reasonable as the sentence-level task mainly relies on sentiment words. In contrast, in aspect-level classification and opinion role labeling, multi-label classification is efficient and yields improvement of up to 0.6 points. This denotes that multi-label classification does capture better dependency between aspect and sentiment, and also the necessity of dealing with such dependency.

Comparison of Vector for Aspect-Sentiment
Pair Prediction SKEP utilizes the sentence rep-resentation, which is the final state of classification token [CLS], for aspect-sentiment pair prediction. We call this Sent-Vector methods. Another way is to use the concatenation of the final vectors of the two words in a pair, which we call Pair-Vector. As shown in Table 6, the performances of these two decisions are very close. We suppose this dues to the robustness of the pre-training approach. As using a single vector for prediction is more efficient, we use final state of token [CLS] in SKEP. Table 5 shows the attention distribution of final layer for the [CLS] token when we adopt our SKEP model to classify the input sentences. On the SST-2 example, despite RoBERTa gives a correct prediction, its attention about sentiment is inaccurate. On the Sem-L case, RoBERTa fails to attend to the word "amazing", and produces a wrong prediction. In contrast, SKEP produces correct predictions and appropriate attention of sentiment information in both cases. This indicates that SKEP has better interpretability.

Related Work
Sentiment Analysis with Knowledge Various types of sentiment knowledge, including sentiment words, word polarity, aspect-sentiment pairs, have been proved to be useful for a wide range of sentiment analysis tasks.
Sentiment words with their polarity are widely used for sentiment analysis, including sentencelevel sentiment classification (Taboada et al., 2011;Shin et al., 2017;Lei et al., 2018;Barnes et al., 2019), aspect-level sentiment classification (Vo and Zhang, 2015), opinion extraction (Li and Lam, 2017), emotion analysis (Gui et al., 2017;Fan et al., 2019) and so on. Lexicon-based method (Turney, 2002;Taboada et al., 2011) directly utilizes polarity of sentiment words for classification. Traditional feature-based approaches encode sentiment word information in manually-designed features to improve the supervised models (Pang et al., 2008;Agarwal et al., 2011). In contrast, deep learning approaches enhance the embedding representation with the help of sentiment words (Shin et al., 2017), or absorb the sentiment knowledge through linguistic regularization (Qian et al., 2017;Fan et al., 2019).
Aspect-sentiment pair knowledge is also useful for aspect-level classification and opinion extraction. Previous works often provide weak supervision by this type of knowledge, either for aspect-level classification (Zeng et al., 2019) or for opinion extraction (Yang et al., 2017;Ding et al., 2017).
Although studies of exploiting sentiment knowledge have been made throughout the years, most of them tend to build a specific mechanism for each sentiment analysis task, so different knowledge is adopted to support different tasks. Whereas our method incorporates diverse knowledge in pretraining to provide a unified sentiment representation for sentiment analysis tasks.
Pre-training Approaches Pre-training methods have remarkably improved natural language processing, using self-supervised training with large scale unlabeled data. This line of research is dramatically advanced very recently, and various types of methods are proposed, including ELMO (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), XLNet  and so on. Among them, BERT pre-trains a bidirectional transformer by randomly masked word prediction, and have shown strong performance gains. RoBERTa  further improves BERT by robust optimization, and become one of the best pre-training methods.
Inspired by BERT, some works propose finegrained objectives beyond random word masking. SpanBERT  masks the span of words at the same time. ERNIE  proposes to mask entity words. On the other hand, pre-training for specific tasks is also studied. GlossBERT (Huang et al., 2019) exploits gloss knowledge to improve word sense disambiguation. SenseBERT (Levine et al., 2019) uses WordNet super-senses to improve word-in-context tasks. A different ERNIE  exploits entity knowledge for entity-linking and relation classification.

Conclusion
In this paper, we propose Sentiment Knowledge Enhanced Pre-training for sentiment analysis. Sentiment masking and three sentiment pre-training objectives are designed to incorporate various types of knowledge for pre-training model. Thought conceptually simple, SKEP is empirically highly effective. SKEP significantly outperforms strong pre-training baseline RoBERTa, and achieves new state-of-the-art on most datasets of three typical specific sentiment analysis tasks. Our work verifies the necessity of utilizing sentiment knowledge for pre-training models, and provides a unified senti-ment representation for a wide range of sentiment analysis tasks.
In the future, we hope to apply SKEP on more sentiment analysis tasks, to further see the generalization of SKEP, and we are also interested in exploiting more types of sentiment knowledge and more fine-grained sentiment mining methods.