SentiLARE: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis

Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, which can promote language understanding in NLP tasks. To beneﬁt the downstream tasks in sentiment analysis, we propose a novel language representation model called SentiLARE, which introduces word-level linguistic knowledge including part-of-speech tag and sentiment polarity (inferred from SentiWordNet) into pre-trained models. We ﬁrst propose a context-aware sentiment attention mechanism to acquire the sentiment polarity of each word with its part-of-speech tag by querying SentiWord-Net. Then, we devise a new pre-training task called label-aware masked language model to construct knowledge-aware language representation. Experiments show that SentiLARE obtains new state-of-the-art performance on a variety of sentiment analysis tasks 1 .


Introduction
Recently, pre-trained language representation models such as GPT (Radford et al., 2018, ELMo (Peters et al., 2018), and BERT (Devlin et al., 2019) have achieved promising results in NLP tasks, including sentiment analysis (Xu et al., 2019(Xu et al., , 2020Yin et al., 2020). These models capture contextual information from large-scale corpora via well-designed pre-training tasks. The literature has commonly reported that pre-trained models can be used as effective feature extractors and achieve state-of-the-art performance on various downstream tasks (Wang et al., 2019a).
Despite the great success of pre-trained models, existing pre-training tasks like masked language model and next sentence prediction (Devlin et al., 2019) neglect to consider the linguistic knowledge. Such knowledge is important for some NLP tasks, particularly for sentiment analysis. For instance, existing work has shown that linguistic knowledge including part-of-speech tag (Qian et al., 2015;Huang et al., 2017) and word-level sentiment polarity (Qian et al., 2017) is closely related to the sentiment of longer texts. We argue that pre-trained models enriched with the linguistic knowledge of words will facilitate the understanding of the sentiment of the whole texts, thereby resulting in better performance on sentiment analysis.
There are two major challenges to construct knowledge-aware pre-trained language representation models which can promote the downstream tasks in sentiment analysis: 1) Knowledge acquisition across different contexts. Most of the existing work has adopted static sentiment lexicons as linguistic resource (Qian et al., 2017;, and equipped each word with a fixed sentiment polarity across different contexts. However, the same word may play different sentiment roles in different contexts due to the variety of part-ofspeech tags and word senses. 2) Knowledge integration into pre-trained models. Since the introduced word-level linguistic knowledge can only reflect the local sentiment role played by each word, it is important to deeply integrate knowledge into pre-trained models to construct sentence-level language representation, which can derive the global sentiment label of a whole sentence from local information. How to build the connection between sentence-level language representation and wordlevel linguistic knowledge is underexplored.
In this paper, we propose a novel pre-trained language representation model called SentiLARE to deal with these challenges. First, to acquire the linguistic knowledge of each word, we label the word with its part-of-speech tag, and obtain the sentiment polarity via a context-aware sentiment attention mechanism over all the matched senses in SentiWordNet (Baccianella et al., 2010). Then, to incorporate linguistic knowledge into pre-trained models, we devise a novel pre-training task called label-aware masked language model. This task involves two sub-tasks: 1) predicting the word, partof-speech tag, and sentiment polarity at masked positions given the sentence-level sentiment label; 2) predicting the sentence-level label, the masked word and its linguistic knowledge including part-ofspeech tag and sentiment polarity simultaneously. We call the first sub-task early fusion since the sentiment labels are aforehand integrated as input embeddings, whereas in the second sub-task, the labels are used as late supervision to the model in the output layer. These two sub-tasks are expected to establish the connection between sentence-level representation and word-level linguistic knowledge, which can benefit downstream tasks in sentiment analysis. Our contributions are in three folds: • We investigate the effectiveness of incorporating linguistic knowledge into pre-trained language representation models, and we reveal that injecting such knowledge via pre-training tasks can benefit various downstream tasks in sentiment analysis.
• We propose a novel pre-trained language representation model called SentiLARE. This model derives a context-aware sentiment polarity for each word using SentiWordNet, and adopts a pre-training task named labelaware masked language model to construct sentiment-aware language representations.
• We conduct extensive experiments on sentence-level and aspect-level sentiment analysis (including extraction and classification). Results show that SentiLARE obtains new state-of-the-art performance on a variety of sentiment analysis tasks.

Related Work
General Pre-trained Language Models Recently, pre-trained language representation models including ELMo (Peters et al., 2018), GPT (Radford et al., 2018, and BERT (Devlin et al., 2019) become prevalent. These models use LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017) as the encoder to acquire contextual language representation, and explore various pre-training tasks including masked language model and next sentence prediction (Devlin et al., 2019). Thanks to the great success of BERT on various NLP tasks, many variants of BERT have been proposed, which mainly fall into four aspects: 1) Knowledge enhancement: ERNIE-Tsinghua (Zhang et al., 2019) / KnowBERT (Peters et al., 2019 explicitly introduces knowledge graph / knowledge base to BERT, while ERNIE-Baidu (Sun et al., 2019b) designs entity-specific masking strategies during pre-training. 2) Transferability: TransBERT (Li et al., 2019) conducts supervised post-training on the pre-trained BERT with transfer tasks to get a better initialization for target tasks.
3) Hyper-parameters: RoBERTa  measures the impact of key hyper-parameters to improve the under-trained BERT. 4) Pre-training tasks: SpanBERT (Joshi et al., 2020) masks consecutive spans randomly instead of individual tokens, while XLNet (Yang et al., 2019) designs a training objective combining both reconstruction and autoregressive language modeling. Pre-trained Models for Sentiment Analysis Another line of work aims to build task-specific pretrained models via post-training on the task data (Gururangan et al., 2020). For sentiment analysis, BERT-PT (Xu et al., 2019) conducts post-training on the corpora which belong to the same domain of the downstream tasks to benefit aspect-level sentiment analysis. DomBERT (Xu et al., 2020) augments the training samples from relevant domains during the pre-training phase to enhance the performance on the aspect-level sentiment analysis of target domains. SentiBERT (Yin et al., 2020) devises a two-level attention mechanism on top of the BERT representation to capture phrase-level compositional semantics.
Compared with the existing work on pre-trained models for sentiment analysis, our work integrates sentiment-related linguistic knowledge from Senti-WordNet (Baccianella et al., 2010) into pre-trained models to construct knowledge-aware language representation, which can benefit a wide range of downstream tasks in sentiment analysis. Linguistic Knowledge for Sentiment Analysis Linguistic knowledge such as part of speech and word-level sentiment polarity is commonly used as external features in sentiment analysis. Part of speech is shown to facilitate the parsing of the syntactic structure of texts (Socher et al., 2013) Figure 1: Overview of SentiLARE. This model first labels each word with its part-of-speech tag, and then uses the word and tag to match the corresponding senses in SentiWordNet. The sentiment polarity of each word is obtained by weighting the matched senses with context-aware sentiment attention. During pre-training, the model is trained based on label-aware masked language model including early fusion and late supervision. Red dotted boxes denote that the linguistic knowledge is used in input embedding or pre-training loss function.
can also be incorporated into all layers of RNN as tag embeddings (Qian et al., 2015). Huang et al. (2017) shows that part of speech can help to learn sentiment-favorable representations.
Word-level sentiment polarity is mostly derived from sentiment lexicons (Hu and Liu, 2004;Wilson et al., 2005). Guerini et al. (2013) obtains the prior sentiment polarity by weighting the sentiment scores over all the senses of a word in SentiWord-Net (Esuli and Sebastiani, 2006;Baccianella et al., 2010). Teng et al. (2016) proposes a context-aware lexicon-based weighted sum model, which weights the prior sentiment scores of sentiment words to derive the sentiment label of the whole sentence. Qian et al. (2017) models the linguistic role of sentiment, negation and intensity words via linguistic regularizers in the training objective of LSTM.

Task Definition and Model Overview
Our task is defined as follows: given a text sequence X = (x 1 , x 2 , · · · , x n ) of length n , our goal is to acquire the representation of the whole sequence H = (h 1 , h 2 , · · · , h n ) ∈ R n×d that captures the contextual information and the linguistic knowledge simultaneously, where d indicates the dimension of representation vectors. Figure 1 shows the overview of our model, which consists of two steps: 1) Acquiring the partof-speech tag and the sentiment polarity for each word; 2) Conducting pre-training via label-aware masked language model, which contains two pretraining sub-tasks, i.e., early fusion and late supervision. Compared with existing BERT-style pre-trained models, our model enriches the input sequence with its linguistic knowledge including partof-speech tag and sentiment polarity, and utilizes label-aware masked language model to capture the relationship between sentence-level language representation and word-level linguistic knowledge.

Linguistic Knowledge Acquisition
This module obtains the part-of-speech tag and the sentiment polarity for each word. The input of this module is a text sequence X = (x 1 , x 2 , · · · , x n ), where x i (1 ≤ i ≤ n) indicates a word in the vocabulary. First, our model acquires the part-of-speech tag pos i of each word x i via Stanford Log-Linear Part-of-Speech Tagger 2 . For simplicity, we only consider five POS tags including verb (v), noun (n), adjective (a), adverb (r), and others (o).
Then, we acquire the word-level sentiment polarity polar i from SentiWordNet for each pair (x i , pos i ). In SentiWordNet, we can find m different senses for the pair (x i , pos i ), each of which contains a sense number, a positive / negative score, and a gloss (SN where SN indicates the rank of different senses, P score/N score is the positive / negative score assigned by SentiWordNet, and G denotes the definition of each sense. Inspired by the existing work on inferring word-level prior polarity from SentiWordNet (Guerini et al., 2013) and unsupervised word sense disambiguation (Basile et al., 2014), we propose a context-aware attention mechanism which simultaneously considers the sense rank and the context-gloss similarity to determine the attention weight of each sense: approximates the impact of sense frequency because a smaller sense rank indicates more frequent use of this sense in natural language (Guerini et al., 2013), and sim(X, G (j) i ) denotes the textual similarity between the context and the gloss of each sense, which is commonly used as an important feature in unsupervised word sense disambiguation (Basile et al., 2014). To calculate the similarity between X and G (j) i , we encode them with Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) which achieves the state-of-the-art performance on semantic textual similarity tasks, and obtain the cosine similarity between the vectors: Once we obtain the attention weight of each sense, we can calculate the sentiment score of each pair (x i , pos i ) by simply weighting the scores of all the senses: Finally, the word-level sentiment polarity polar i for the pair (x i , pos i ) can be assigned with P ositive/N egative/N eutral when s(x i , pos i ) is positive / negative / zero, respectively. Note that if we cannot find any sense for (x i , pos i ) in SentiWordNet, polar i is assigned with N eutral.

Pre-training Task
Given the knowledge enhanced text sequence }, the goal of the pretraining task is to construct the knowledge-aware representation vectors H = (h 1 , · · · , h n ) which can promote the downstream tasks in sentiment analysis. We devise a new supervised pre-training task called label-aware masked language model (LA-MLM), which introduces the sentence-level sentiment label l into the pre-training phase to capture the dependency between sentence-level language representation and individual words. It contains two separate sub-tasks: early fusion and late supervision.

Early Fusion
The purpose of early fusion is to recover the masked sequence conditioned on the sentence-level label, as shown in Figure 1. Assume thatX k denotes the knowledge enhanced text sequence with some masked tokens, we can obtain the representation vectors with the input ofX k and the sentencelevel sentiment label l: where h EF cls and h EF sep are the hidden states of the special tokens [CLS] and [SEP]. The input embeddings ofX k contains the embedding used in BERT (Devlin et al., 2019), the part-ofspeech (POS) embedding and the word-level polarity embedding. Additionally, the embedding of the sentence-level sentiment label l is early added to the input embeddings. The model is required to predict the word, part-of-speech tag, and wordlevel polarity at the masked positions individually, thus the loss function is devised as follows: where m t is an indicator function and equals to 1 iff x t is masked. The prediction probabilities P (x t |X k , l), P (pos t |X k , l) and P (polar t |X k , l) are calculated based on the hidden state h EF t . This sub-task explicitly exerts the impact of the global sentiment label on the words and the linguistic knowledge of words, enhancing the ability of our model to explore the complex connection among them.  Table 1: Fine-tuning setting of SentiLARE on downstream tasks. Both x 1 · · · x n and y 1 · · · y m indicate the text sequences, while a 1 · · · a l denotes the aspect term / category sequence. The output hidden states are then used in the classification / regression layer.

Late Supervision
The late supervision sub-task aims to predict the sentence-level label and the word information based on the hidden states at [CLS] and masked positions respectively, as shown in Figure 1. Similar to early fusion, the representation vectors with the input ofX k are obtained as follows: In this sub-task, the sentiment label l is used as the late supervision signal. Thus, the loss function to simultaneously predict the sentence-level sentiment label, words, and linguistic knowledge of words is shown as follows: where the sentence-level classification probability P (l|X k ) is calculated based on the hidden state h LS cls . This sub-task enables our model to capture the implicit relationship among the sentence-level representation at [CLS] and word-level linguistic knowledge at masked positions.
Since the two sub-tasks are separate, we empirically set the percentage of pre-training data provided for the late supervision sub-task as 20% and early fusion as 80%. As for the masking strategy, we increase the probability of masking words with positive / negative sentiment polarity from 15% in the setting of BERT to 30% because they are more likely to impact the sentiment of the whole text.

Pre-training Dataset and Implementation
We adopted the Yelp Dataset Challenge 2019 3 as our pre-training dataset. This dataset contains 3 https://www.yelp.com/dataset/ challenge 6,685,900 reviews with 5-class review-level sentiment labels. Each review consists of 127.8 words on average.
Since our method can adapt to all the BERTstyle pre-trained models, we used RoBERTa  as the base framework to construct Transformer blocks in this paper, and discussed the generalization ability to other pre-trained models like BERT (Devlin et al., 2019) in the following experiment. The hyper-parameters of the Transformer blocks were set to be the same as RoBERTa-Base due to the limited computational power. Considering the high cost of training from scratch, we utilized the parameters of pre-trained RoBERTa 4 to initialize our model. We also followed RoBERTa to use Byte-Pair Encoding vocabulary (Radford et al., 2019) whose size was 50,265. The maximum sequence length in the pre-training phase was 128, while the batch size was 400. We took Adam (Kingma and Ba, 2015) as the optimizer and set the learning rate to be 5e-5. The warmup ratio was 0.1. SentiLARE was pre-trained on Yelp Dataset Challenge 2019 for 1 epoch with label-aware masked language model, which took about 20 hours on 4 NVIDIA RTX 2080 Ti GPUs.

Fine-tuning Setting
We fine-tuned SentiLARE to the downstream tasks including sentence-level sentiment classification, aspect-level sentiment analysis, and general text matching tasks in our experiments. We adopted the fine-tuning settings in the existing work (Devlin et al., 2019;Xu et al., 2019), and showed the input format and the output hidden states of each task in Table 1. Note that the input embeddings in all the downstream tasks only contain BERT embedding, part-of-speech embedding and word-level polarity embedding. The hyper-parameters of fine-tuning on different datasets are reported in the Appendix.

Baselines
We compared SentiLARE with general pre-trained models, task-specific pre-trained models and taskspecific models without pre-training. General Pre-trained Models: We adopted BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), and RoBERTa  as general pretrained baselines. These models achieve state-ofthe-art performance on various NLP tasks. Task-specific Pre-trained Models: We used BERT-PT (Xu et al., 2019), TransBERT (Li et al., 2019), and SentiBERT (Yin et al., 2020) as taskspecific pre-trained baselines. Since TransBERT is not originally designed to deal with sentiment analysis tasks, we chose review-level sentiment classification on Yelp Dataset Challenge 2019 as the transfer task, and the downstream tasks in sentiment analysis as the target tasks. Task-specific Models without Pre-training: We also chose some task-specific baselines without pre-training for corresponding tasks, including SC-SNN (Chen et  We evaluated all the pre-trained baselines based on the codes and the model parameters provided by the original papers. For a fair comparison, all the pre-trained models were set to the base version, which possess a similar number of parameters (about 110M). The experimental results were presented with mean values over 5 runs. As for the task-specific baselines without pre-training, we re-printed the results on the corresponding benchmarks from the references.   (Maas et al., 2011), and Yelp-2/5 (Zhang et al., 2015), which are widely used datasets at different scales. The detailed statistics of these datasets are shown in Table 2, which contain the amount of training / validation / test sets, the average length and the number of classes. Since MR, IMDB, and Yelp-2/5 don't have validation sets, we randomly sampled subsets from the training sets as the validation sets, and evaluated all the pre-trained models with the same data split.   The results are shown in Table 3. We can observe that SentiLARE performs better than the baselines on the sentence-level sentiment classification datasets, thereby indicating the effectiveness of our knowledge-aware representation in sentiment understanding. Compared with vanilla RoBERTa, SentiLARE enhances the performance on all the datasets significantly. This demonstrates that for sentiment analysis tasks, linguistic knowledge can be used to enhance the state-of-the-art pre-trained model via the well-designed pre-training task.

Aspect-level Sentiment Analysis
Aspect-level sentiment analysis includes aspect term extraction, aspect term sentiment classification, aspect category detection, and aspect category sentiment classification. For aspect term based tasks, we chose SemEval2014 Task 4 for laptop (Lap14) and restaurant (Res14) Table 4. We followed the existing work (Xu et al., 2019) to leave 150 examples from the training sets for validation. Since the number of the examples with the conflict sentiment label is rather small, we adopted the same setting as the existing work (Tang et al., 2016;Xu et al., 2019) and dropped these examples in the aspect term / category sentiment classification task.
We present the results of aspect-level sentiment analysis in Table 5. We can see that SentiLARE outperforms the baselines on all the four tasks, and most of the improvement margins are significant. Interestingly, in addition to aspect-level sentiment classification, our model also performs well in aspect term extraction and aspect category detection. Since the aspect words are mostly nouns, part-ofspeech tags may provide additional knowledge for the extraction task. In addition, the aspect terms can be signaled by neighboring sentiment words. This may explain why our knowledge-aware representation can help to extract (detect) the aspect term (category).

Ablation Test
To study the effect of the linguistic knowledge and the label-aware masked language model, we conducted ablation tests and presented the results in Table 6. Since the two sub-tasks are separate, the setting of -EF / -LS indicates that the pre-training data were all fed into the late supervision / early fusion sub-task, and -EF-LS denotes that the pretraining task is changed from label-aware masked language model to vanilla masked language model, while the input embeddings still include part-ofspeech and word-level polarity embeddings. The -POS / -POL setting means that we removed the part of speech / word-level sentiment polarity in the input embeddings respectively, as well as in the supervision signals of the two sub-tasks. Obviously, -POS-POL indicates the complete removal of linguistic knowledge.
Results in Table 6 show that both the pre-training task and the linguistic knowledge contribute to the improvement over RoBERTa. Compared with early fusion, the late supervision sub-task plays a more important role in the classification tasks which depend on the global representation of the input sequence, such as SSC, ATSC, ACD and ACSC. Intuitively, the late supervision sub-task may learn a meaningful global representation at [CLS] by  Table 6: Ablation test on sentiment analysis tasks. EF / LS / POS / POL denotes early fusion / late supervision / part-of-speech tag / word-level polarity, respectively. simultaneously predicting the sentence-level sentiment labels and the word knowledge. Thus, it contributes more to the performance on these classification tasks.
As for the impact of linguistic knowledge, the performance of SentiLARE degrades more in the setting of removing the word-level sentiment polarity. This implies that the word-level polarity can help the pre-trained model more to derive the global sentiment in the classification tasks and signal neighboring aspects in the extraction task.

Analysis on Knowledge Acquisition
To investigate whether our proposed context-aware knowledge acquisition method can help construct knowledge-aware language representation, we compared the context-aware sentiment attention described in §3.2 with a context-free prior polarity acquisition algorithm (Guerini et al., 2013). This algorithm simply acquires a fixed sentiment polarity for each word with its part-of-speech tag by weighting the sentiment score of each sense with the reciprocal of the sense number, regardless of the variety of context. All the other parts of Senti-LARE remain unchanged for comparison between these two knowledge acquisition methods.   Table 7 show that our context-aware method performs better on all the tasks. This demonstrates that our context-aware attention mechanism can help SentiLARE model the sentiment roles of words across different contexts, thereby leading to better knowledge enhanced language representation.

Analysis on Knowledge Integration
To further demonstrate the importance of labelaware masked language model which deeply integrates linguistic knowledge into pre-trained models, we divided the test set of SST into two subsets according to the number of sentiment words (including positive and negative words determined by linguistic knowledge) in the sentences. Since there are 6.48 sentiment words on average in the sentences of SST's test set, we partitioned the test set into two subsets: SST-Less contains the sentences with no more than 7 sentiment words, and SST-More includes the other sentences. Intuitively and presumably, the sentences with more sentiment words may include more complex sentiment expressions. We compared three models: RoBERTa which does not use linguistic knowledge, SentiLARE-EF-LS which simply augments input embeddings with linguistic knowledge as described in §4.6, and Sen-tiLARE which deeply integrates linguistic knowledge via the pre-training task.   Table 8 show that SentiLARE-EF-LS already outperforms RoBERTa remarkably on SST-Less by simply augmenting input features with linguistic knowledge. However, on SST-More, SentiLARE-EF-LS only obtains marginal improvement over RoBERTa. As for our model, Senti-LARE can consistently outperform RoBERTa and SentiLARE-EF-LS, and the margin between Senti-LARE and SentiLARE-EF-LS is more evident on SST-More. This indicates that our pre-training task can help to integrate the local sentiment information reflected by word-level linguistic knowledge into global language representation, and facilitate the understanding of complex sentiment expressions.

Analysis on Generalization Ability
Generalization to Other Pre-trained Models: To study whether the introduced linguistic knowledge and the proposed pre-training task can improve the performance of other pre-trained models besides RoBERTa, we chose BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) as the base framework for evaluation. The experimental settings were set the same as SentiLARE based on RoBERTa.    Table 9 show that SentiLARE based on BERT / ALBERT outperforms vanilla BERT / ALBERT on the datasets of all the tasks, which demonstrates that our proposed method can adapt to different BERT-style pre-trained models to benefit the tasks in sentiment analysis.  Generalization to Other NLP Tasks: Since sentiment is a common feature to improve the tasks of text matching (Cai et al., 2017;Li et al., 2019), we chose three text matching tasks to explore whether our sentiment-aware representations can also benefit these tasks, including story ending prediction, textual entailment, and semantic textual similarity. We evaluated RoBERTa and SentiLARE on the datasets of SCT (Mostafazadeh et al., 2016), SICK (Marelli et al., 2014), andSTSb (Cer et al., 2017) for the three tasks respectively. The statistics of these datasets are reported in Table 10. We followed the existing work (Li et al., 2019) to preprocess the SCT dataset, and directly adopted the official version of SICK and STSb datasets. Results in Table 11 show that SentiLARE can  enhance the performance of RoBERTa on these text matching tasks. This indicates that sentimentrelated linguistic knowledge can be successfully integrated into the pre-trained language representation model to not only benefit the sentiment analysis tasks but also generalize to other sentimentrelated NLP tasks. We will explore the generalization of our model to more NLP tasks as future work.

Conclusion
We present a novel pre-trained model called Senti-LARE for sentiment analysis, which introduces linguistic knowledge from SentiWordNet via contextaware sentiment attention, and adopts label-aware masked language model to deeply integrate knowledge into BERT-style models through pre-training tasks. Experiments show that SentiLARE outperforms state-of-the-art language representation models on various sentiment analysis tasks, and thus facilitates sentiment understanding.

A.1 Hyper-parameter Setting
We provided the hyper-parameter search space during pre-training in Table 12. Grid search was used to select hyper-parameters, and the selection criterion was the classification accuracy on the validation set when we fine-tuned the pre-trained model on SST. We also provided the detailed setting of hyperparameters during fine-tuning on the datasets of sentiment analysis, including hyper-parameter search space in Table 13 and best assignments in Table 14. Note that we used HuggingFace's Transformers 5 to implement our model, so all the hyperparameters we reported were consistent with the codes of HuggingFace's Transformers. We utilized manual search to select the best hyper-parameters 5 https://github.com/huggingface/ transformers  during fine-tuning. The number of hyper-parameter search trials for each dataset was 20. We used accuracy as our criterion for selection on all the sentiment analysis tasks except aspect term extraction and aspect category detection. For these two tasks, F1 was adopted as the selection criterion.

A.2 Results on Validation Sets
In addition to the performance on the test set of each dataset which has been reported in the main paper, we also provided the validation performance on the datasets of sentence-level and aspect-level sentiment analysis in Table 15. As mentioned in Appendix A.1, accuracy and F1 were used to select the best hyper-parameters, so we reported the validation performance of all the pre-trained models on these metrics.

A.3 Runtime
The runtime of fine-tuning on different datasets of sentiment analysis was reported in Table 16. We tested all the pre-trained models on 4 NVIDIA RTX 2080 Ti GPUs.

B Case Study
To intuitively show that SentiLARE can integrate linguistic knowledge into pre-trained models to promote sentiment analysis, we provided a case, and visualized the classification probability of all the prefix subsequences truncated at each position in Figure  Compared with RoBERTa, our model enhanced with word-level linguistic knowledge can successfully capture the sentiment shift caused by the word change in this sentence, thereby determining the correct sentence-level sentiment label.  Since Sentence-BERT is costly to calculate the textual similarity between contexts and glosses, we compared it with another lighter textual similarity algorithm (Basile et al., 2014) which computes the representation vectors of sentences by averaging the embedding vectors of their constituent words. We used 300-dimensional GloVe 6 as the word vectors to obtain textual similarity.

C Analysis on Textual Similarity in Knowledge Acquisition
Results in Table 17 show that Sentence-BERT performs better on all the tasks. Nevertheless, static word vectors are more computationally efficient in linguistic knowledge acquisition with acceptable performance drop.