Exploiting BERT for End-to-End Aspect-based Sentiment Analysis

In this paper, we investigate the modeling power of contextualized embeddings from pre-trained language models, e.g. BERT, on the E2E-ABSA task. Specifically, we build a series of simple yet insightful neural baselines to deal with E2E-ABSA. The experimental results show that even with a simple linear classification layer, our BERT-based architecture can outperform state-of-the-art works. Besides, we also standardize the comparative study by consistently utilizing a hold-out validation dataset for model selection, which is largely ignored by previous works. Therefore, our work can serve as a BERT-based benchmark for E2E-ABSA.


Introduction
Aspect-based sentiment analysis (ABSA) is to discover the users' sentiment or opinion towards an aspect, usually in the form of explicitly mentioned aspect terms (Mitchell et al., 2013;Zhang et al., 2015) or implicit aspect categories (Wang et al., 2016), from user-generated natural language texts (Liu, 2012).The most popular ABSA benchmark datasets are from SemEval ABSA challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016) ) where a few thousand review sentences with gold standard aspect sentiment annotations are provided.
Table 1 summarizes three existing research problems related to ABSA.The first one is the original ABSA, aiming at predicting the sentiment polarity of the sentence towards the given aspect.Compared to this classification problem, the second one and the third one, namely, Aspectoriented Opinion Words Extraction (AOWE) (Fan et al., 2019) and End-to-End Aspect-based Sentiment Analysis (E2E-ABSA) (Ma et al., 2018a;Schmitt et al., 2018;Li et al., 2019a;Li andLu, 2017, 2019), are related to a sequence tagging problem.Precisely, the goal of AOWE is to extract the aspect-specific opinion words from the sentence given the aspect.The goal of E2E-ABSA is to jointly detect aspect terms/categories and the corresponding aspect sentiments.
Many neural models composed of a taskagnostic pre-trained word embedding layer and task-specific neural architecture have been proposed for the original ABSA task (i.e. the aspectlevel sentiment classification) (Tang et al., 2016;Wang et al., 2016;Chen et al., 2017;Liu and Zhang, 2017;Ma et al., 2017Ma et al., , 2018b;;Majumder et al., 2018;Li et al., 2018;He et al., 2018;Xue and Li, 2018;Wang et al., 2018;Fan et al., 2018;Huang and Carley, 2018;Lei et al., 2019;Li et al., 2019b;Zhang et al., 2019) 2 , but the improvement of these models measured by the accuracy or F1 score has reached a bottleneck.One reason is that the task-agnostic embedding layer, usually a linear layer initialized with Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), only provides context-independent word-level features, which is insufficient for capturing the complex semantic dependencies in the sentence.Meanwhile, the size of existing datasets is too small to train sophisticated task-specific architectures.Thus, introducing a context-aware word embedding3 layer pre-trained on large-scale datasets with deep LSTM (McCann et al., 2017;Peters et al., 2018;Howard and Ruder, 2018) or Transformer (Radford et al., 2018(Radford et al., , 2019;;Devlin et al., 2019 Hu et al. (2019a) have conducted some initial attempts to couple the deep contextualized word embedding layer with downstream neural models for the original ABSA task and establish the new state-of-the-art results.It encourages us to explore the potential of using such contextualized embeddings to the more difficult but practical task, i.e.E2E-ABSA (the third setting in Table 1). 4Note that we are not aiming at developing a task-specific architecture, instead, our focus is to examine the potential of contextualized embedding for E2E-ABSA, coupled with various simple layers for prediction of E2E-ABSA labels. 5 In this paper, we investigate the modeling power of BERT (Devlin et al., 2019), one of the most popular pre-trained language model armed with Transformer (Vaswani et al., 2017), on the task of E2E-ABSA.Concretely, inspired by the investigation of E2E-ABSA in Li et al. (2019a), which predicts aspect boundaries as well as aspect sentiments using a single sequence tagger, we build a series of simple yet insightful neural baselines for the sequence labeling problem and fine-tune the task-specific components with BERT or deem BERT as feature extractor.Besides, we standardize the comparative study by consistently utilizing the hold-out development dataset for model 4 Both of ABSA and AOWE assume that the aspects in a sentence are given.Such setting makes them less practical in real-world scenarios since manual annotation of the finegrained aspect mentions/categories is quite expensive. 5Hu et al. (2019b) introduce BERT to handle the E2E-ABSA problem but their focus is to design a task-specific architecture rather than exploring the potential of BERT.

Segment Embedding
Token Embedding selection, which is ignored in most of the existing ABSA works (Tay et al., 2018).

Model
In this paper, we focus on the aspect termlevel End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) problem setting.This task can be formulated as a sequence labeling problem.
The overall architecture of our model is depicted in Figure 1.Given the input token sequence for the input tokens where dim h denotes the dimension of the representation vector.Then, the contextualized representations are fed to the taskspecific layers to predict the tag sequence y = {y 1 , • • • , y T }.The possible values of the tag y t are B-{POS,NEG,NEU}, I-{POS,NEG,NEU}, E-{POS,NEG,NEU}, S-{POS,NEG,NEU} or O, denoting the beginning of aspect, inside of aspect, end of aspect, single-word aspect, with positive, negative or neutral sentiment respectively, as well as outside of aspect.

BERT as Embedding Layer
Compared to the traditional Word2Vec-or GloVebased embedding layer which only provides a single context-independent representation for each token, the BERT embedding layer takes the sentence as input and calculates the token-level representations using the information from the entire sentence.First of all, we pack the input features as , where e t (t ∈ [1, T ]) is the combination of the token embedding, position embedding and segment embedding corresponding to the input token x t .Then L transformer layers are introduced to refine the token-level features layer by layer.Specifically, the representations ) layer are calculated below: We regard H L as the contextualized representations of the input tokens and use them to perform the predictions for the downstream task.

Design of Downstream Model
After obtaining the BERT representations, we design a neural layer, called E2E-ABSA layer in Figure 1, on top of BERT embedding layer for solving the task of E2E-ABSA.We investigate several different design for the E2E-ABSA layer, namely, linear layer, recurrent neural networks, self-attention networks, and conditional random fields layer.
Linear Layer The obtained token representations can be directly fed to linear layer with softmax activation function to calculate the tokenlevel predictions: where W o ∈ R dim h ×|Y| is the learnable parameters of the linear layer.
Recurrent Neural Networks Considering its sequence labeling formulation, Recurrent Neural Networks (RNN) (Elman, 1990) is a natural solution for the task of E2E-ABSA.In this paper, we adopt GRU (Cho et al., 2014), whose superiority compared to LSTM (Hochreiter and Schmidhuber, 1997) and basic RNN has been verified in Jozefowicz et al. (2015).The computational formula of the task-specific hidden representation h T t ∈ R dim h at the t-th time step is shown below: where σ is the sigmoid activation function and r t , z t , n t respectively denote the reset gate, update gate and new gate.W x , W h ∈ R 2dim h ×dim h , W xn , W hn ∈ R dim h ×dim h are the parameters of GRU.Since directly applying RNN on the output of transformer, namely, the BERT representation h L t , may lead to unstable training (Chen et al., 2018;Liu, 2019), we add additional layernormalization (Ba et al., 2016), denoted as LN, when calculating the gates.Then, the predictions are obtained by introducing a softmax layer: Self-Attention Networks With the help of self attention (Cheng et al., 2016;Lin et al., 2017), Self-Attention Network (Vaswani et al., 2017;Shen et al., 2018) is another effective feature extractor apart from RNN and CNN.In this paper, we introduce two SAN variants to build the task-specific token representations One variant is composed of a simple self-attention layer and residual connection (He et al., 2016), dubbed as "SAN".The computational process of SAN is below: where SLF-ATT is identical to the self-attentive scaled dot-product attention (Vaswani et al., 2017).Another variant is a transformer layer (dubbed as "TFM"), which has the same architecture with the transformer encoder layer in the BERT.The computational process of TFM is as follows: where FFN refers to the point-wise feed-forward networks (Vaswani et al., 2017).Again, a linear layer with softmax activation is stacked on the designed SAN/TFM layer to output the predictions (same with that in Eq(4)).
Conditional Random Fields Conditional Random Fields (CRF) (Lafferty et al., 2001) is effective in sequence modeling and has been widely adopted for solving the sequence labeling tasks together with neural models (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016).In this paper, we introduce a linear-chain CRF layer on top of the BERT embedding layer.Different from the above mentioned neural models maximizing the token-level likelihood p(y t |x t ), the  CRF-based model aims to find the globally most probable tag sequence.Specifically, the sequencelevel scores s(x, y) and likelihood p(y|x) of y = {y 1 , • • • , y T } are calculated as follows: where M A ∈ R |Y|×|Y| is the randomly initialized transition matrix for modeling the dependency between the adjacent predictions and M P ∈ R T ×|Y| denote the emission matrix linearly transformed from the BERT representations H L .The softmax here is conducted over all of the possible tag sequences.As for the decoding, we regard the tag sequence with the highest scores as output: where the solution is obtained via Viterbi search.
The statistics are summarized in Table 3.We use the pre-trained "bert-base-uncased" model6 , where the number of transformer layers L = 12 and the hidden size dim h is 768.For the downstream E2E-ABSA component, we consistently use the single-layer architecture and set the dimension of task-specific representation as dim h .The learning rate is 2e-5.The batch size is set as 25 for LAPTOP and 16 for REST.We train the model up to 1500 steps.After training 1000 steps, we conduct model selection on the development set for very 100 steps according to the micro-averaged F1 score.Following these settings, we train 5 models with different random seeds and report the average results.

Main Results
From Table 2, we surprisingly find that only introducing a simple token-level classifier, namely, BERT-Linear, already outperforms the existing works without using BERT, suggesting that BERT representations encoding the associations between arbitrary two tokens largely alleviate the issue of context independence in the linear E2E-ABSA layer.It is also observed that slightly more powerful E2E-ABSA layers lead to much better performance, verifying the postulation that incorporating context helps to sequence modeling.

Over-parameterization Issue
Although we employ the smallest pre-trained BERT model, it is still over-parameterized for the E2E-ABSA task (110M parameters), which naturally raises a question: does BERT-based model tend to overfit the small training set?Following this question, we train BERT-GRU, BERT-TFM and BERT-CRF up to 3000 steps on REST and observe the fluctuation of the F1 measures on the development set.As shown in Figure 2, F1 scores on the development set are quite stable and do not decrease much as the training proceeds, which shows that the BERT-based model is exceptionally robust to overfitting.

Finetuning BERT or Not
We also study the impact of fine-tuning on the final performances.Specifically, we employ BERT to calculate the contextualized token-level representations but kept the parameters of BERT component unchanged in the training phase.Figure 3 illustrate the comparative results between the BERT-based models and those keeping BERT component fixed.Obviously, the general purpose BERT representation is far from satisfactory for the downstream tasks and task-specific fine-tuning is essential for exploiting the strengths of BERT to improve the performance.

Conclusion
In this paper, we investigate the effectiveness of BERT embedding component on the task of Endto-End Aspect-Based Sentiment Analysis (E2E-ABSA).Specifically, we explore to couple the BERT embedding component with various neural models and conduct extensive experiments on two benchmark datasets.The experimental results demonstrate the superiority of BERT-based mod- els on capturing aspect-based sentiment and their robustness to overfitting.

Figure 1 :
Figure 1: Overview of the designed model.

Figure 2 :
Figure 2: Performances on the Dev set of REST.

Table 2 :
Li et al. (2019a)symbol denotes the numbers are officially reported ones.The results with are retrieved fromLi et al. (2019a).

Table 3 :
Statistics of datasets.