Understanding Pre-trained BERT for Aspect-based Sentiment Analysis

This paper analyzes the pre-trained hidden representations learned from reviews on BERT for tasks in aspect-based sentiment analysis (ABSA). Our work is motivated by the recent progress in BERT-based language models for ABSA. However, it is not clear how the general proxy task of (masked) language model trained on unlabeled corpus without annotations of aspects or opinions can provide important features for downstream tasks in ABSA. By leveraging the annotated datasets in ABSA, we investigate both the attentions and the learned representations of BERT pre-trained on reviews. We found that BERT uses very few self-attention heads to encode context words (such as prepositions or pronouns that indicating an aspect) and opinion words for an aspect. Most features in the representation of an aspect are dedicated to the fine-grained semantics of the domain (or product category) and the aspect itself, instead of carrying summarized opinions from its context. We hope this investigation can help future research in improving self-supervised learning, unsupervised learning and fine-tuning for ABSA. The pre-trained model and code can be found at https://github.com/howardhsu/BERT-for-RRC-ABSA.


Introduction
As a form of self-supervised learning in NLP, pre-trained language models (LMs) like the masked LM in BERT (Devlin et al., 2019;Lan et al., 2019) yield significant performance gains when later fine-tuned on downstream NLP tasks. Recent studies also showed impressive results on tasks in aspect-based sentiment analysis (ABSA) (Xu et al., 2019;Sun et al., 2019a;Li et al., 2019b;Tian et al., 2020;Karimi et al., 2020), which aims to discover aspects and their associated opinions (Hu and Liu, 2004;Liu, 2012). Although there are existing studies of the hidden representations and attentions of LMs about tasks such as parsing and co-reference resolution (Adi et al., 2016;Belinkov et al., 2017;Clark et al., 2019), it is unclear how LMs capture aspects and sentiment/opinion from large-scale unlabeled texts.
This paper attempts to investigate and understand the inner workings of the pretext task of the masked language model (MLM) in transformer and their connections with tasks in ABSA. This may benefit the following problems: (1) improving fine-tuning of ABSA if we have a better understanding about the gap between pretext tasks and fine-tuning tasks; (2) more importantly, self-supervised (or unsupervised) ABSA without fine-tuning to save the expensive efforts on annotating ABSA datasets (for a new domain).
We are particularly interested in fine-grained token-level features that are typically required by ABSA and how MLM as a general task can cover them during pre-training. Typical tasks of ABSA are: aspect extraction (AE), aspect sentiment classification (ASC) (Hu and Liu, 2004;Dong et al., 2014;Nguyen and Shirai, 2015;Li et al., 2018;Tang et al., 2016;Wang et al., 2016a;Wang et al., 2016b;Ma et al., 2017;Chen et al., 2017;Ma et al., 2017;Tay et al., 2018;He et al., 2018;Liu et al., 2018) and end-to-end ABSA (E2E-ABSA) (Li et al., 2019a;Li et al., 2019b). AE aims to extract aspects (e.g., "battery" in the laptop domain), ASC identifies the polarity for a given aspect (e.g., positive about battery) and E2E-ABSA is a combination of AE and ASC that detects aspects and their associated polarities simultaneously. Existing studies show that tasks in ABSA require the understanding of the interactions of aspects (e.g., "screen" in the laptop domain) and its contexts, including sentiment (e.g., "clear") (Qiu et al., 2011;. As such, we believe how the hidden representation of an aspect encodes features about being an aspect and summarizing opinions of that aspect are crucial for ABSA. This paper represents a new addition to the existing analysis of BERT in (Clark et al., 2019), which focuses on studying the behavior of BERT's hidden representation and self-attentions for general purposes. We focus on how the self-supervised training of BERT is prepared for fine-grained features that are important for ABSA. We leverage the annotated data of ABSA in our analysis to draw the relevance between pre-trained features and labels for ABSA. Note that we do not use any of such annotations for fine-tuning or training. Also, we do not study LMs that carry extra human supervision for sentiment analysis (such as using opinion lexicons as in (Ke et al., 2019;Tian et al., 2020)) because we are interested in to what degree a general self-supervised task such as MLM can cover specific tasks in ABSA. Unlike (Clark et al., 2019;Sun et al., 2019b;Wan et al., 2020), we focus on masked language model (MLM) for fine-grained token-level features instead of next sentence prediction (NSP) since the latter is not widely adopted for pre-trained models .
Our main finding is that BERT (pre-trained on reviews) encodes rich semantic knowledge about the domain and aspect itself into the hidden representations of an aspect but uses almost no dedicated features for opinions. Inside BERT, very few self-attention heads are learned for encoding salient context words for finding an aspect or summarizing opinion words for an aspect. This suggests the pros and cons of MLM. For example, predicting a masked aspect word to learn features for an aspect is a weak selfsupervised learning task for ABSA. This leads to future directions on alternative self-supervised learning tasks for ABSA.

Pre-trained LMs and Datasets
We expect to simplify our analysis on the same latent space for multiple domains of ABSA. As such, we pre-train BERT on reviews with large coverage of domains (product categories). The training corpus is a combination of Amazon reviews (He and McAuley, 2016) and Yelp review datasets 2 , which give us a review corpus of 20+ GB in size. We start from fine-tuning BERT base on such a corpus for 4 epochs. We train it on a single TITAN RTX GPU for 10 days.
To understand the hidden representation of an aspect and its formulation from self-attention, we leverage the popular SemEval 2014 Task 4 and SemEval-2016 Task 5 in ABSA with annotations about aspects and their associated opinions. These benchmark datasets cover the domains of Laptop and Restaurant. We sample 150 examples from each domain as the validation data for analysis. Note that we do not use any annotated data for fine-tuning or training because we are interested in how relevant the features from pre-trained BERT to ABSA.

Analysis
To see the inner workings of masked language modeling (MLM) on reviews, we first review MLM and transformers. Then we perform two types of evaluations: self-attention of aspects and hidden representations on aspects.
Transformer (Vaswani et al., 2017) is a neural architecture that purely uses multi-head self-attentions to learn hidden representations of input texts. Unlike the classic LSTM or CNN, the connections in selfattention can be viewed as a fully-connected graph on nodes of tokens without strong inductive bias on contexts or prior knowledge on relative positions. So transformer uses positional embeddings to encode tokens at different positions. Further, multi-head attentions can be viewed as typed relations to model various kinds of relations among tokens.
MLMs aim to recover texts corrupted with masked tokens. For example, from "The [MASK] is clear." in the laptop domain, one can easily guess that [MASK] is probably "screen". As a result, the transformer model must use self-attentions to infer the hidden representation of "screen" from "The", "is" and "clear". Note that the first embedding layer of BERT and the MLM prediction heads are both context-independent because they just contain word embeddings, whereas the other internal layers inbetween are context-dependent. As such, how BERT encodes such contexts into the representation of an aspect is important for ABSA.

Self-Attentions of Aspects
Intuitively, self-attention can serve as a way to aggregate the representations of contextual tokens into an aspect. Following (Clark et al., 2019), we notice some attention heads exhibit general patterns such as (a) no-op on [CLS] or [SEP], (b) offsets on previous/next tokens and (c) broadcasting over the whole sentence, as in Figure 1. For pattern (a), transformer has to take no-ops on redundant tokens given that the softmax function in self-attention always normalized to 1 even when no relevant context tokens are presented. For pattern (b), self-attention needs recurrent patterns to construct local contexts from nearby tokens. Pattern (c) can be viewed as a global average pooling operation so that each token contains knowledge from the whole sequence. We are also interested in how an aspect interacts with its contexts, including both context words that indicate an aspect (e.g., "of" and "has") and opinion words (e.g., "good"). We search through all 144 heads in BERT base but only find 2 to 4 heads of such types residing in the middle layers. . This is because the last layer (prediction head of MLM) are context-independent and the word embedding of "screen" is expected to have no sentiment. This may hurt the representations of aspect words to carry sentiment.

Hidden Representations on Aspects
Next, we analyze aspects in the latent space. First, since tasks in ABSA are strongly domain-dependent, we are interested in how much domain knowledge is carried in aspects. With aspects under t-SNE (Maaten and Hinton, 2008) dimensionality reduction as in Figure 2(a), we notice that BERT exhibits a strong separation of domains in its latent space of aspects. This indicates that BERT devotes most of its feature dimensions to domain differences. We plot the hidden representations of the last layer for tokens about aspects and non-aspects in Figure 2(b). We can see that in the 2D space of t-SNE, aspects and non-aspect tokens occupy slightly different spaces, indicating a great number of dimensions in the latent space are devoted to features that can be used to separate aspects from its contexts. This correlates well with existing research showing that BERT has impressive results on aspect extraction (Xu et al., 2019). We further investigate whether there exists a single general (cross-domain) neuron to separate aspects from other words. This is important for unsupervised or zero-shot learning for aspect extraction because if one can easily adapt a pre-trained BERT for a new domain without annotated fine-tuning data. We train an L1-normed probe on hidden states of aspects and non-aspect words from both domains, as depicted in Figure 4(a). The neuron 180 stands out from other neurons. To test whether this neuron is salient or general enough, we use this neuron alone as the single feature for aspect word classification (extraction). The F1 score is 31%, whereas using all neurons reaches 79%. This implies no salient neuron is available for general-purpose aspect extraction.
Further, we are interested in the relations between aspects and their associated opinions. Since an aspect may contain multiple words ("customer service") and a word may be tokenized into multiple tokens in BERT ("warrant" and "##y"), we average the representations of tokens belonging to one aspect. From Figure 3, we can see that the dimensions that dominate the hidden space of aspects are their semantic meanings, where aspects with similar meanings are closer. For example, different kinds of software on laptop are closer. The differences in opinion do not exhibit significant impacts on the representation of an aspect, indicating few dimensions are dedicated to opinions.
To verify the existence of features contributing to opinions, we train a (logistic regression) probe to classify sentiment (positive or negative) over the frozen latent space of aspects, with L1-norm on weights to pick salient features. We obtain an F1-score of 83% on the polarities of aspects in the test set. In Figure  4(b), one can observe that no single neuron stands out as a feature for sentiment classification on aspects but many features are correlated with the sentiment. This is in contrast with the findings of sentiment neuron in pre-trained casual LMs (Radford et al., 2017) and we suspect the small hidden space (768) does not allow such a single neuron to exist. This indicates that MLM is a weak task for summarizing and carrying sentiment polarities to aspects (e.g., E2E-ABSA).
We further explore the hidden representations of opinion words (or opinion terms (OT)), as in Figure  3(b). One can observe that opinion words of different polarities do not have their very own sub-spaces but occupy similar spaces in general. Instead, they carry their non-opinion meanings (such as the semantic of speed in "fast") and reflect such differences in the latent space.

Summary and Open Problems
Our analysis showed that MLM tends to learn very fine-grained features and dedicate most of the aspects' features to domains and semantics of aspects themselves rather than opinions. We believe using pretrained BERT is good for AE or the extraction part of E2E-ABSA but poor for summarizing opinions in ASC or polarity detection in E2E-ABSA. End tasks may still require a good amount of examples to explore the large feature space of BERT.
We believe this analysis leads to the following open problems: alternative self-supervised learning tasks besides MLM. We believe MLM is far from perfect as a pretext task for ABSA and both learned features for AE and ASC can be improved. We believe the main weakness of MLM is that when learning representations for an aspect word, it does not need to know the sentiment in most cases and being an aspect or not is not a strong feature to learn MLM. How to design a pretext task to learn disentangle aspect features and opinion features from fine-grained features of other semantics is still an open problem. For aspect features, one may group reviews for the same item (e.g., a product) to encourage the signal for aspects against other items. For sentiment features, one may use rating as a weakly supervised signal to strengthen aspect words with sentiment.