SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics

We propose SentiBERT, a variant of BERT that effectively captures compositional sentiment semantics. The model incorporates contextualized representation with binary constituency parse tree to capture semantic composition. Comprehensive experiments demonstrate that SentiBERT achieves competitive performance on phrase-level sentiment classification. We further demonstrate that the sentiment composition learned from the phrase-level annotations on SST can be transferred to other sentiment analysis tasks as well as related tasks, such as emotion classification tasks. Moreover, we conduct ablation studies and design visualization methods to understand SentiBERT. We show that SentiBERT is better than baseline approaches in capturing negation and the contrastive relation and model the compositional sentiment semantics.


Introduction
Sentiment analysis is an important language processing task (Pang et al., 2002(Pang et al., , 2008Liu, 2012). One of the key challenges in sentiment analysis is to model compositional sentiment semantics. Take the sentence "Frenetic but not really funny." in Figure 1 as an example. The two parts of the sentence are connected by "but", which reveals the change of sentiment. Besides, the word "not" changes the sentiment of "really funny". These types of negation and contrast are often difficult to handle when the sentences are complex (Socher et al., 2013;Tay et al., 2018;Xu et al., 2019).
In general, the sentiment of an expression is determined by the meaning of tokens and phrases and the way how they are syntactically combined. Prior studies consider explicitly modeling compositional sentiment semantics over constituency structure with recursive neural networks (Socher et al., 2012 Figure 1: Illustration of the challenges of learning sentiment semantic compositionality. The blue nodes represent token nodes. The colors of phrase nodes in the binary constituency tree represent the sentiment of phrases. The red boxes show that the sentiment changes from the child node to the parent node due to negation and contrast. 2013). However, these models that generate representation of a parent node by aggregating the local information from child nodes, overlook the rich association in context.
In this paper, we propose SentiBERT to incorporate recently developed contextualized representation models (Devlin et al., 2019; with the recursive constituency tree structure to better capture compositional sentiment semantics. Specifically, we build a simple yet effective attention network for composing sentiment semantics on top of BERT (Devlin et al., 2019). During training, we follow BERT to capture contextual information by masked language modeling. In addition, we instruct the model to learn composition of meaning by predicting sentiment labels of the phrase nodes.
Results on phrase-level sentiment classification on Stanford Sentiment Treebank (SST) (Socher et al., 2013) indicate that SentiBERT improves significantly over recursive networks and the base- Module I is the BERT encoder; Module II denotes the semantic composition module based on an attention mechanism; Module III is a predictor for phrase-level sentiment. The semantic composition module is a two layer attention-based network (see Section 3.1) The first layer (Attention to Tokens) generates representation for each phrase based on the token it covers and the second layer (Attention to Children) refines the phrase representation obtained from the first layer based on its children. line BERT model. As phrase-level sentiment labels are expensive to obtain, we further explore if the compositional sentiment semantics learned from one task can be transferred to others. In particular, we find that SentiBERT trained on SST can be transferred well to other related tasks such as twitter sentiment analysis (Rosenthal et al., 2017) and emotion intensity classification (Mohammad et al., 2018) and contextual emotion detection (Chatterjee et al., 2019). Furthermore, we conduct comprehensive quantitative and qualitative analyses to evaluate the effectiveness of SentiBERT under various situations and to demonstrate the semantic compositionality captured by the model. The source code is available at https://github.com/ WadeYin9712/SentiBERT.

Related Work
Sentiment Analysis Various approaches have been applied to build a sentiment classifier, including feature-based methods (Hu and Liu, 2004;Pang and Lee, 2004), recursive neural networks (Socher et al., 2012(Socher et al., , 2013Tai et al., 2015), convolution neural networks (Kim, 2014) and recurrent neural networks (Liu et al., 2015). Recently, pretrained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and Sen-tiLR (Ke et al., 2019) achieve high performance in sentiment analysis by constructing contextualized representation. Inspired by these prior studies, we design a transformer-based neural network model to capture compositional sentience semantics by leveraging binary constituency parse tree. Semantic composition (Pelletier, 1994) has been widely studied in NLP literature. For example, Mitchell and Lapata (2008) introduce operations such as addition or element-wise product to model compositional semantics. The idea of modeling semantic composition is applied to various areas such as sentiment analysis (Socher et al., 2013;Zhu et al., 2016), semantic relatedness (Marelli et al., 2014) and capturing sememe knowledge (Qi et al., 2019). In this paper, we demonstrate that the syntactic structure can be combined with contextualized representation such that the semantic compositionality can be better captured. Our approach resembles to a few recent attempts (Harer et al., 2019;Wang et al., 2019) to integrate tree structures into self-attention. However, our design is specific for the semantic composition in sentiment analysis.

Model
We introduce SentiBERT, a model that captures compositional sentiment semantics based on constituency structures of sentences. SentiBERT consists of three modules: 1) BERT; 2) a semantic composition module based on an attention network; 3) phrase and sentence sentiment predictors. The three modules are illustrated in Figure 2 and we provide an overview in below.
BERT We incorporate BERT (Devlin et al., 2019) as the backbone to generate contextualized representation of input sentence.

Semantic Composition Module
This module aims to obtain effective phrase representation guided by the contextualized representation and constituency parsing tree. To refine phrase representation based on the structural information and its constituencies, we design a two-level attention mechanism: 1) Attention to Tokens and 2) Attention to Children.
Phrase Node Prediction SentiBERT is supervised by phrase-level sentiment labels. We use cross-entropy as the loss function for learning the sentiment predictor.

Attention Networks for Sentiment Semantic Composition
In this section, we describe the attention networks for sentiment semantic composition in detail. We first introduce the notations. s = [w 1 , w 2 , ..., w n ] denotes a sentence which consists of n words. phr = [phr 1 , phr 2 , ..., phr m ] denotes the phrases on the binary constituency tree of sentence s. h = [h 1 , h 2 , ..., h n ] is the contextualized representation of tokens after forwarding to a fully- Attention to Tokens Given the contextualized representations of the tokens covered by a phrase. We first generate phrase representation v i for a phrase i by the following attention network.
In Eq. (1), we first treat the averaged representation for each token as the query, and then allocate attention weights according to the correlation with each token. a j represents the weight distributed to the j-th token. We concatenate the weighted sum o i and q i and feed it to forward networks.
Lastly, we obtain the initial representation for the phrase v i ∈ R d based on the representation of constituent tokens. The detailed computation of attention mechanism is shown in Appendix A.1.
Attention to Children Furthermore, we refine phrase representations in the second layer based on constituency parsing tree and the representations obtained in the first layer. To aggregate information based on hierarchical structure, we develop the following network. For each phrase, the attention network computes correlation with its children in the binary constituency parse tree and itself.
Assume that the indices of child nodes of the i-th phrase are lson and rson. Their representations generated from the first layer are v i , v lson , and v rson , respectively. We generate the attention weights r lson , r rson and r i over the i-th phrase and its left and right children by the following. (2) Then the refined representation of phrase i is computed by Finally, we concatenate the weighted sum f i and v i and feed it to forward networks with SeLU (Klambauer et al., 2017) and GeLU activations (Hendrycks and Gimpel, 2017) and layer normalization (Ba et al., 2016), similar to Joshi et al. (2020) to generate the final phrase representation p i ∈ R d . Note that when the child of i-th phrase is token node, the attention mechanism will attend to the representation of all the subtokens the token node covers.

Training Objective of SentiBERT
Inspired by BERT, the training objective of SentiBERT consists of two parts: 1) Masked Language Modeling. Some texts are masked and the model learn to predict them. This objective allows the model learn to capture the contextual information as in the original BERT model. 2) Phrase Node Prediction. We further consider training the model to predict the phrase-level sentiment label based on the aforementioned phrase representations. This allows SentiBERT lean to capture the compositional sentiment semantics. Similar to BERT, in the transfer learning setting, pre-trained SentiBERT model can be used to initialize the model parameters of a downstream model.

Experiments
We evaluate SentiBERT on the SST dataset. We then evaluate SentiBERT in a transfer learning setting and demonstrate that the compositional sentiment semantics learned on SST can be transferred to other related tasks.

Experimental Settings
We evaluate how effective SentiBERT captures the compositional sentiment semantics on SST dataset (Socher et al., 2013).
The SST dataset has several variants.
• SST-phrase is a 5-class classification task that requires to predict the sentiment of all phrases on a binary constituency tree. Different from Socher et al. (2013), we test the model only on phrases (non-terminal constituents) and ignore its performance on tokens.
• SST-5 is a 5-class sentiment classification task that aims at predicting the sentiment of a sentence. We use it to test if SentiBERT learns a better sentence representation through capturing compositional sentiment semantics.
• Similar to SST-5, SST-2 and SST-3 are 2class and 3-class sentiment classification tasks. However, the granularity of the sentiment classes is different.
We build SentiBERT on the HuggingFace library 1 and initialize the model parameters using pre-trained BERT-base and RoBERTa-base models whose maximum length is 128, layer number is 12, and embedding dimension is 768. For the training on SST-phrase, the learning rate is 2 × 10 −5 , batch size is 32 and the number of training epochs is 3. For masking mechanism, to put emphasis on modeling sentiments, the probability of masking opinion words which can be retrieved from Senti-WordNet (Baccianella et al., 2010) is set to 20%, and for the other words, the probability is 15%. For fine-tuning on downstream tasks, the learning rate is {1×10 −5 −1×10 −4 }, batch size is {16, 32} and the number of training epochs is 1−5. We use Stanford CoreNLP API (Manning et al., 2014) to obtain binary constituency trees for the sentences of these tasks to keep consistent with the settings on SSTphrase. Note that when fine-tuning on sentencelevel sentiment and emotion classification tasks, the objective is to correctly label the root of tree, instead of targeting at the [CLS] token representation as in the original BERT.

Effectiveness of SentiBERT
We first compare the proposed attention networks (SentiBERT w/o BERT) with the following baseline models trained on SST-phrase corpus to evaluate the effectiveness of the architecture design: 1) Recursive NN (Socher et al., 2013); 2) GCN (Kipf and Welling, 2017); 3) Tree-LSTM (Tai et al., 2015); 4) BiLSTM (Hochreiter and Schmidhuber, 1997) w/ Tree-LSTM. To further understand the effect of using contextualized representation, we compare SentiBERT with the vanilla pre-trained BERT and its variants which combine the four mentioned baselines and BERT. The training settings remain the same with SentiBERT. We also initialize SentiBERT with pre-trained parameters of RoBERTa (SentiBERT w/ RoBERTa) and further compare it with its variants.
As shown in Table 1, SentiBERT and SentiBERT w/ RoBERTa substantially outperforms their corresponding variants and the networks merely built on the tree. Specifically, we first observe that though our attention network (SentiBERT w/o BERT) is simple, it is competitive with Recursive NN, GCN and Tree-LSTM. Besides, SentiBERT largely outperforms SentiBERT w/o BERT by leveraging contextualized representation. Moreover, the results manifest that SentiBERT and SentiBERT w/ RoBERTa outperform the BERT and RoBERTa, indicating the importance of incorporating syntactic guidance.

Transferability of SentiBERT
Though the designed models are effective, we are curious how beneficial the compositional sentiment semantics learned on SST can be transferred to other tasks. We compare SentiBERT with pub-  lished models BERT, XLNet, RoBERTa and their variants on benchmarks mentioned in Section 4.1. Specifically, 'BERT' indicates the model trained on the raw texts of the SST dataset. 'BERT w/ Mean pooling' denotes the model trained on SST, whose phrase and sentence representation is computed by mean pooling on tokens. 'BERT w/ Mean pooling' merely leverages the phrases' range information rather than syntactic structural information.

Sentiment Classification Tasks
The evaluation results of sentence-level sentiment classification on the three tasks are shown in Table 2. Despite the difference among tasks and datasets, from experimental results, we find that SentiBERT has competitive performance compared with various baselines. SentiBERT achieves higher performance than the vanilla BERT and XLNet in tasks such as SST-3 and Twitter Sentiment Analysis. Besides, SentiBERT significantly outperform  Emotion Classification Tasks Emotion detection is different from sentiment classification. However, these two tasks are related. The task aims to classify fine-grained emotions, such as happiness, fearness, anger, sadness, etc. It is challenging compared to sentiment analysis because of various emotion types. We fine-tune SentiBERT and SentiBERT w/ RoBERTa on Emotion Intensity Classification and EmoContext. Table 3 shows the results on the two emotion classification tasks. Similar to the results in sentiment classification tasks, SentiBERT obtains the best results, further justifying the transferability of SentiBERT.

Analysis
We conduct experiments on SST-phrase using BERT-base model as backbone to demonstrate the effectiveness and interpretability of the SentiBERT architecture in terms of semantic compositionality. We also explore potential of the model when lacking phrase-level sentiment information. In order to simplify the analysis of the change of sentiment polarity, we convert the 5-class labels to to 3-class: the classes 'very negative' and 'negative' are converted to be 'negative'; the classes 'very positive' and 'positive' are converted to be 'positive'; the class 'neutral' remains the same. The details of statistical distribution in this part is shown in Appendix A.3. We consider the following baselines to evaluate the effectiveness of each component in SentiBERT. First we design BERT w/ Mean pooling as a base model, to demonstrate the ne- cessity of incorporating syntactic guidance and implementing aggregation on it. Then we compare SentiBERT with alternative aggregation approaches, Tree-LSTM, GCN and w/o Attention to Children.

Semantic Compositionality
We investigate how effectively SentiBERT captures compositional sentiment semantics. We focus on how the representation in SentiBERT captures the sentiments when the children and parent in the constituency tree have different sentiments (i.e., sentiment switch) as shown in the red boxes of Figure 1. Here we focus on the sentiment switches between phrases. We assume that the more the sentiment switches, the harder the prediction is. We analyze the model under the following two scenarios: local difficulty and global difficulty. Local difficulty is defined as the number of sentiment switches between a phrase and its children. As we consider binary constituency tree. The maximum number of sentiment switches for each phrase is 2. Global difficulty indicates number of sentiment switches in the entire constituency tree. The maximum number of sentiment switches in the test set is 23. The former is a phrase-level analysis and the latter is sentence level.
We compare SentiBERT with aforementioned baselines. We group all the nodes and sentences in the test set by local and global difficulty. Results are shown in Figure 3 and Figure 4. Our model achieves better performance than baselines in all situations. Also, we find that with the increase of difficulty, the gap between our models and baselines becomes larger. Especially, when the sentiment labels of both children are different from the parent node (i.e., local difficulty is 2), the performance gap between SentiBERT and BERT w/ Tree-LSTM is about 7% accuracy. It also outperforms the baseline BERT model with mean pooling by 15%. This validates the necessity of structural information for semantic composition and the effectiveness of our designed attention networks for leveraging the hierarchical structures.

Negation and Contrastive Relation
Next, we investigate how SentiBERT deals with negations and contrastive relation.
Negation: Since the negation words such as 'no', 'n't' and 'not' will cause the sentiment switches, the number of negation words also reflects the difficulty of understanding sentence and its constituencies. We first group the sentences by the number of negation words, and then calculate the accuracy of the prediction on their constituencies respectively. In test set, as there are at most six negation words and the amount of sentences with above three negation words is small, we separate all the data into three groups.
Results are provided in Figure 5. We observe SentiBERT performs the best among all the models. Similar to the trend in local and global difficulty experiments, the gap between SentiBERT and other baselines becomes larger with increase of negation words. The results show the ability of SentiBERT when dealing with negations.  Table 4: Evaluation for contrastive relation (%). We show the accuracy for triple-lets ('X but Y', 'X', 'Y'). X and Y must be phrases in our experiments.
Contrastive Relation: We evaluate the effectiveness of SentiBERT with regards to tackling contrastive relation problem. Here, we focus on the contrastive conjunction "but". We pick up the sentences containing word 'but' of which the sentiments of left and right parts are different. In our analysis, a 'X but Y' can be counted as correct if and only if the sentiments of all the phrases in triplelet ('X but Y', 'X' and 'Y') are predicted correctly. Table 4 demonstrates the results. SentiBERT outperforms other variants of BERT about 1%, demonstrating its ability in capturing contrastive relation in sentences.

Case Study
We showcase several examples to demonstrate how SentiBERT performs sentiment semantic composition. We observe the attention distribution among hierarchical structures. In Figure 7, we demonstrate two sentences of which the sentiments of all the phrases are predicted correctly. We also visualize the attention weights distributed to the child nodes and the phrases themselves to see which part might contribute more to the sentiment of those phrases.
SentiBERT performs well in several aspects.
First, SentiBERT tends to attend to adjectives such as 'frenetic' and 'funny', which contribute to the phrases' sentiment. Secondly, facing negation words, SentiBERT considers them and a switch can be observed between the phrases with and without negation word (e.g., 'not really funny' and 'really funny'). Moreover, SentiBERT can correctly analyze the sentences expressing different sentiments in different parts. For the first case, the model concentrates more on the part after 'but'.

Amount of Phrase-level Supervision
We are also interested in analyzing how much phrase-level supervision SentiBERT needs in order to capture the semantic compositionality. We vary the amount of phrase-level annotations used in training SentiBERT. Before training, we randomly sample 0% to 100% with a step of 10% of labels from SST training set. After pre-training on them, we fine-tune SentiBERT on tasks SST-5, SST-3 and Twitter Sentiment Analysis. During fine-tuning, for the tasks which use phrase-level annotation, such as SST-5 and SST-3, we use the same phrase-level annotation during pre-training and the sentence-level annotation; for the tasks which do not have phrase-level annotation, we merely use the sentence-level annotation. Results in Figure 6 show that with about 30%-50% of the phrase labels on SST-5 and SST-3, the model is able to achieve competitive results compared with XLNet. Even without any phrase-level supervision, using 70%-80% of phrase labels in pre-training allows SentiBERT competitive with XLNet on the Twitter Sentiment Analysis dataset.
Furthermore, we find the confidence of about 40-50% of phrase nodes in SST-3 task is above 0.9 and the accuracy of predicting these phrases is above 90% on the SST dataset. Considering the previous results, we speculate if we produce part of the phrase labels on generic texts, choose the predicted labels with high confidence and add them to the original SST training set during the training process, the results might be further improved.

Conclusion
We proposed SentiBERT, an architecture designed for capturing better compositional sentiment semantics. SentiBERT considers the necessity of contextual information and explicit syntactic guidelines for modeling semantic composition. Experiments show the effectiveness and transferability of SentiBERT. Further analysis demonstrates its interpretability and potential with less supervision. For future work, we will extend SentiBERT to other applications involving phrase-level annotations.

A.1 Details of Correlation Computation in Attention Networks
For vectors a and b, the correlation between them is computed as below: where SeLU (Klambauer et al., 2017) is an activation function and α equals 4. The two layers of attention networks do not share the parameters.

A.2 Details of Downstream Tasks
We adopt the following tasks for evaluation of sentence-level sentiment classifications: SST-2,3 (Socher et al., 2013) These tasks all share with the text of the SST dataset and are singlesentence sentiment classification task, of which the numbers behind indicate the number of classes. Since two of five classes in SST-5 correspond to positive and another two indicate negative, with additional neutral ones, the dataset is separated into three groups in SST-3 task. We convert the 5-class phrase-level labels in SST-5 into three classes and leverage them in the training of SST-3 task.
Twitter Sentiment Analysis (Rosenthal et al., 2017) For Twitter Sentiment Analysis, given a tweet, model needs to decide which sentiment it expresses: positive, negative or neutral. hammad et al., 2018) The task is, given a tweet and an emotion, categorizing the tweet into one of four classes of intensity that best represents tweeter's mental state. For Emotion Intensity Classification task, the metric is averaged Pearson Correlation value of the four subtasks, 'happiness', 'sadness', 'anger' and 'fearness'. (Chatterjee et al., 2019) In a dialogue, given a sentence with two turns of conversation, the models needs to classify the emotion expressed in the last sentence. For EmoContext, we follow the standard metrics used in Chatterjee et al. (2019) and use F1 score on the three classes 'happy', 'sad' and 'angry', except 'others' class, as the evaluation metric.

Emotions in Textual Conversations
The statistics of datasets is shown in Table 5.      Table 9: The results after incorporating token node prediction. 'Token' denotes token node prediction.

A.3 Details of Analysis Part
The distribution of nodes and sentences in terms of local difficulty, global difficulty and negation words is shown in Table 6, 7 and 8, respectively.

A.4 Incorporating Token Node Prediction
Since the SST dataset also provides token-level sentiment labels, we combine the token node prediction with phrase node prediction learning objective together to model compositional sentiment semantics.
Results are shown in Table 9. We observe that the results drops a bit after additionally incorporating token-level sentiment information. This may be because the phrase sentiment is composed but the token sentiment mainly depends on the meaning of the lexicon itself rather than a kind of compositional sentiment semantics. The inconsistency of the training objectives may result in the performance drop.