GRACE: Gradient Harmonized and Cascaded Labeling for Aspect-based Sentiment Analysis

In this paper, we focus on the imbalance issue, which is rarely studied in aspect term extraction and aspect sentiment classification when regarding them as sequence labeling tasks. Besides, previous works usually ignore the interaction between aspect terms when labeling polarities. We propose a GRadient hArmonized and CascadEd labeling model (GRACE) to solve these problems. Specifically, a cascaded labeling module is developed to enhance the interchange between aspect terms and improve the attention of sentiment tokens when labeling sentiment polarities. The polarities sequence is designed to depend on the generated aspect terms labels. To alleviate the imbalance issue, we extend the gradient harmonized mechanism used in object detection to the aspect-based sentiment analysis by adjusting the weight of each label dynamically. The proposed GRACE adopts a post-pretraining BERT as its backbone. Experimental results demonstrate that the proposed model achieves consistency improvement on multiple benchmark datasets and generates state-of-the-art results.


Introduction
Aspect terms extraction (ATE) and aspect sentiment classification (ASC) are two fundamental, fine-grained subtasks in aspect-based sentiment analysis (ABSA). ATE is the task of extracting the aspect terms (or attributes) of an entity upon which opinions have been expressed, and ASC is the task of identifying the polarities expressed on these extracted terms in the opinion text (Hu and Liu, 2004). Consider the example in Figure 1, which contains comments that people expressed about the aspect terms "operating system" and "keyboard", and their polarities are all positive. * Work is done during an internship at MSR Asia. † Correspongding author. To better satisfy the practical applications, the aspect term-polarity co-extraction, which solves ATE and ASC simultaneously, receives much attention in recent years Luo et al., 2019b;Hu et al., 2019;Wan et al., 2020). A big challenge of the aspect term-polarity co-extraction in a unified model is that ATE and ASC belong to different tasks: ATE is usually a sequence labeling task, and ASC is usually a classification task. Previous works usually transform the ASC task into sequence labeling. Thus the ATE and ASC have the same formulation.
There are two approaches of sequence labeling on the aspect term-polarity co-extraction. As shown in Figure 1, one is the joint approach, and the other is the collapsed approach. The preceding one jointly labels each sentence with two different tag sets: aspect term tags and polarity tags. The subsequent one uses collapsed labels as the tags set, e.g., "B-POS" and "I-POS", in which each tag indicates the aspect term boundary and its polarity. Except for the joint and collapsed approaches, a pipelined approach first labels the given sentence using aspect term tags, e.g., "B" and "I" (the beginning and inside of an aspect term), and then feeds the aspect terms into a classifier to obtain their corresponding polarities.
Several related works have been published in these approaches. Mitchell et al. (2013) and  found that the joint and collapsed approaches are superior to the pipelined approach on named entities and their sentiments co-extraction.  proposed a unified model with the collapsed approach to do aspect term-polarity co-extraction. Hu et al. (2019) solved this task with a pipelined approach. Luo et al. (2019b) adopted the joint approach to do such a co-extraction. We follow the joint approach in this paper, and believe that it has a more apparent of responsibilities than the collapsed approach through learning parallel sequence labels.
However, previous works on the joint approach usually ignore the interaction between aspect terms when labeling polarities. Such an interaction is useful in identifying the polarity. As an instance, in Figure 1, if "operating system" is positive, "keyboard" should be positive due to these two aspect terms are connected by coordinating conjunction "and". Besides, almost all of previous works do not concern the imbalance of labels in such sequence labeling tasks. As shown in 2a, the number of 'O' labels is much larger than that of 'B' and 'I', which tends to dominant the training loss. Moreover, we find the same gradient phenomenon as  in the sequence labeling task. As shown in Figure 2b, most of the labels own low gradients, which have a significant impact on the global gradient due to their large number.
Considering the above issues, we propose a GRadient hArmonized and CascadEd labeling model (GRACE) in this paper. The proposed GRACE is shown in Figure 3. Unlike previous works, GRACE is a cascaded labeling model, which uses the generated aspect term labels to enhance the polarity labeling in a unified framework. Specifically, we use two encoder modules shared with lower layers to extract representation. One encoder module is for ATE, and the other is for ASC after giving the aspect term labels generated by the preceding encoder. Thus, the GRACE could consider the interaction between aspect terms in the ASC module through a stacked Multi-Head Attention (Vaswani et al., 2017). Besides, we extend a gradient harmonized loss to address the imbalance labels in the model training phase.
Our contributions are summarized as follows: • A novel framework GRACE is proposed to address the aspect term-polarity co-extraction problem in an end-to-end fashion. It utilizes a cascaded labeling approach to consider the interaction between aspect terms when labeling their sentiment tags.
• The imbalance issue of labels is considered, and a gradient harmonized strategy is extended to alleviate it. We also use virtual adversarial training and post-training on domain datasets to improve co-extraction performance.
In the following, we describe the proposed framework GRACE in Section 2. The experiments are conducted in Section 3, followed by the related work in Section 4. Finally, we conclude the paper in Section 5.

Model
An overview of GRACE is given in Figure 3. It is comprised of two modules with the shared shallow layers: one is for ATE, and the other is for ASC. We will first formulate the co-extraction problem and then describe the framework in detail in this section.

Problem Statement
This paper deals with aspect term-polarity coextraction, in which the aspect terms are explicitly mentioned in the text. We solve it as two sequence labeling tasks. Formally, given a review sentence S with n words from a particular domain, denoted by S = {w i |i = 1, . . . , n}. For each word w i , the objective of our task is to assign a tag t e i ∈ T e , and a tag t c i ∈ T c to it, where T e = {B, I, O} and T c = {POS, NEU, NEG, CON, O}. The tags 'B', 'I' and 'O' in T e stand for the beginning, the inside of an aspect term, and other words, respectively. The tags POS, NEU, NEG, and CON indicate polarity categories: positive, neutral, negative, and conflict, respectively 1 .  other words like that in T e . Figure 1 shows a labeling example of the joint and collapsed approaches.

GRACE: Gradient Harmonized and Cascaded Model
We focus on the joint labeling approach in the paper. As shown in Figure 3, the proposed GRACE contains two branches with the shared shallow layers. In order to benefit from the pretrained model, we use the BERT-Base as our backbone. Then the representation H e of ATE can be generated on the pretrained BERT: where H [1:L] denotes the representation of each layer of BERT. It varies from the 1st layer to the L-th layer. L is the max layer of BERT, e.g., 12 in BERT-Base. H e ∈ R (n+2)×h is the representation H L belonging to the last layer, in which two extra embeddings belong to special tokens [CLS] and [SEP], and the labels of them are set to 'O' in the experiments. h is the hidden size,n is the length of S after tokenizing by the wordpiece vocabulary. Different layers of BERT capture different levels of information, e.g., phrase-level information in the lower layers and linguistic information in intermediate layers (Jawahar et al., 2019). The higher layers are usually task-related. Thus, a shared BERT between ATE and ASC tasks is the right choice. We extract the representation H c for ASC task from the l-th layer of BERT: Thus, H [l+1:L] is task-specific for ATE. An extreme state is l = L, where all layers are shared across both tasks. We omit an exhaustive description of BERT and refer readers to Devlin et al. (2019) for more details. Cascaded Labeling We can do sequence labeling on the H e and H c directly. However, it is not a customized feature for ASC. Conversely, ASC may decline the ATE performance. One reason is the difference between ATE and ASC. The polarity of an aspect term usually does not come from the term itself. For example, the polarity of aspect term "operating system" in Figure 1 comes from the adjective "nice". When labeling the "operating system", the model needs to point to the "nice". The other reason is the interaction between aspect terms is ignored when labeling their sentiment labels. For example, the "operating system" and "keyboard" are connected by coordinating conjunction "and". If "operating system" is positive, "keyboard" should be positive, too. Thus, we propose the cascaded labeling approach, which uses the generated aspect terms sequence as the input to generate the sentiment sequence. As shown in Figure 3, the H c is fed to a new Transformer-Decoder (Vaswani et al., 2017) as key K and value V to generate a new aspect sentiment representation G c : where Q represents the aspect term labels generated by the ATE module (ground-truth labels in the training phase). The vocab size is |T e | in the word embedding of the Transformer-Decoder.
Note that the Transformer-Decoder here is not the same as the original transformer decoder. The difference is that we use Multi-Head Attention instead of Masked Multi-Head Attention as the first sub-layer because the ASC is not an autoregressive task and does not need to predict the output sequence one element at a time. Gradient Harmonized Loss The cross entropy is used to train the model: Then, the losses from both tasks are constructed as the joint loss of the entire model: where L e and L c denote the loss for aspect term and polarity, respectively. Θ represents the model parameters containing all trainable weight matrices and bias vectors. However, there are two well-known disharmonies to affect the performance through the optimization of the above losses. The first one is the imbalance between positive and negative examples, and the other one is the imbalance between easy and hard examples . Specifically, there exists the imbalance between each label in our labeling task. As shown in Figure 2a, the label 'O' occupies a tremendous rate than other labels. According to the work from , the easy and hard attributes of labels can be represented by the norm of gradient g: wheret is the ground-truth with value 0 or 1, p is the score calculated by a softmax operation, z is the logit output of a model, L is the cross entropy. E.g., z = M τ w τ and p in Eq. (5), and L in Eq. (6). Figure 2b shows the statistic of labels w.r.t gradient norm g. Most of the labels own low gradients, which have a significant impact on the global gradient due to their large number. A strategy is to decrease the weight of loss from these labels.
We rewrite the Eq. (6) following GHM-C, which used in object detection , as follows: where g t τ i is the gradient norm of t τ i calculated by Eq. (8), N τ is the total number of labels, ρ(g) is gradient density: where δ (x, y) is 1 if y − 2 ≤ x < y + 2 otherwise 0. The ρ(g) denotes the number of labels lying in the region centered at g with a length of and normalized by the valid length l (g) = min g + 2 , 1 − max g − 2 , 0 of the region. The calculation of β t τ i will use the unit region to reduce complexity. Specifically, the gradient norm g will put into m = 1/ unit regions. For the j-th unit region u j = j − , j , the gradient density can be approximated as: where U j denotes the number of labels lying in u j , The calculation ofρ(g) assumpts that the examples lying in the same unit region share the same gradient density. So it can be calculated by the algorithm of histogram statistics.
A further reasonable manner is to statistics U (t) j in the t-th iteration to reduce the complexity of U j 's statistic cross the dataset, and uses A (t) j to approximate the real U j as follows: where α is a momentum parameter. Thus, theρ(g) is updated by: Virtual Adversarial Training To make the model more robust to adversarial noise, we utilize the virtual adversarial training (VAT) used in (Miyato et al., 2016) to make small perturbations r to the input Token Embedding E when training model. The additional loss is as follows: the adversarial perturbation r is calculated by: where and ξ are hyperparameters, d is sampled from normal distribution N (0, I),Θ is a constant set to the current parameters Θ, D KL (· ·) is the KL divergence, p(·|·) is the model conditional probability.
On the whole, the total loss of the proposed GRACE is: where L e and L c are calculated by Eq. (9), denote the loss for aspect term and polarity, respectively. L VAT denotes the VAT loss, calculated by Eq. (16). Consistent Polarity Label A question when regarding sentiment classification as polarity sequence labeling is that the generated sequence labels are not always consistent. For instance, the polarity labels may be 'POS NEG' for the aspect term 'operating system'. To solve this problem, we design a strategy on the representation of tokens within the same aspect term. To the generated sequence labels of ASC, we first get the boundaries of aspect terms according to the meaning of the label, e.g., Then the aspect sentiment representation G c , and the classification is calculated as follows: where G c [b ind : e ind ] is a snippet of G c from b ind to e ind (exclusive), max is a max-pooling operator along with the sequence dimension. w h is a trainable weight matrix. f (·) is the ReLU function. We use h i to calculate loss as Eq. (5) and Eq. (9). It is a consistent strategy to generate sentiment labels, although it cannot improve the performance in our preliminary experiments.

Datasets
We evaluate the proposed model on three benchmark sentiment analysis datasets, two of which  come from the SemEval challenges, and the last comes from an English Twitter dataset, as shown in Table 1. D L contains laptop reviews from SemEval 2014 (Pontiki et al., 2014), and D R are restaurant reviews merged from SemEval 2014 (D R-14 ), Se-mEval 2015 (D R-15 ) (Pontiki et al., 2015), and Se-mEval 2016 (D R-16 ) (Pontiki et al., 2016). We keep the official data division of these datasets for the training set, validation set, and testing set. The reported results of D L and D R are average scores of 5 runs. D T consists of English tweets. Due to a lack of standard train-test split, we report the ten-fold cross-validation results of D T as done in Luo et al., 2019b). The evaluation metrics are precision (P), recall (R), and F1 score based on the exact match of aspect term and its polarity.

Post-training
Domain knowledge is proved useful for domainspecific tasks (Xu et al., 2019;Luo et al., 2019b). In this paper, we adopt Amazon reviews 2 and Yelp reviews 3 , which are in-domain corpora for laptop and restaurant, respectively, to do a post-training on uncased BERT-Base for our tasks. The Amazon review dataset contains 142.8M reviews, and the Yelp review dataset contains 2.2M restaurant reviews. We combine all these reviews to finish our post-training. The maximum length of posttraining is set to 320. The batch size is 4,096 for BERT-Base with gradient accumulation (every 32 steps). The BERT-Base is implemented base on the transformers library with Pytorch 4 . The mask strategy is Whole Word Masking (WWM), the same as the official BERT 5 . We use Adam optimizer and set the learning rate to be 5e-5 with 10% warmup steps.
Our pretrained model is carried out 10 epochs on 8 NVIDIA Tesla V100 GPU. We use fp16 to speed up training and to reduce memory usage. The pretraining process takes more than 5 days.

Settings
During fine-tuning on ATE and ASC tasks, the optimizer is Adam with 10% warmup steps. A twostage training strategy is utilized in our cascaded labeling model. In the first stage, we first finetune the ATE part initialized with the post-trained BERT weights. The learning rate is set to 3e-5 with 32 batch size, and running 5 epochs without virtual adversarial training. Then we plus virtual adversarial to continue to fine-tune 1 epoch for D L and 3 epochs for other datasets with learning rate 1e-5. In the second stage, we fine-tune both ATE and ASC modules initialized with the weights from the first stage. The ASC decoder is initialized with the last corresponding layers of the ATE module. The learning rate is set to 3e-5 for the ASC part and 3e-6 for the ATE part with 32 batch size, and running 10 epochs. The maximum length is set to 128 on all datasets. in Eq. (11) is 24, and the momentum parameter α in Eq. (13) is 0.75. ξ in Eq. (17) is set to 1e-6, and in Eq. (19) is set to 2. We set the shared layers l = 9, and the number of transformer layers for ASC to 2. All the above hyper-parameters are tuned on the validation set of D L and D R . We implement our GRACE using the same library as post-training, and all computations are done on NVIDIA Tesla V100 GPU.

Baseline Methods
We compare our model 6 with the following models: E2E-TBSA  is an end-to-end model of the collapsed approach proposed to address ATE and ASC simultaneously. DOER (Luo et al., 2019b) employs a cross-shared unit to train the ATE and ASC jointly. SPAN  is a pipeline approach built on BERT-Large (SPAN Large ) to solve aspect term-sentiment pairs extraction. We implement its BERT-Base version (SPAN Base ) using the available code 7 . BERT-E2E-ABSA (Li et al., 2019c) is a BERTbased benchmark for aspect term-sentiment pairs 6 Code and pre-trained weights will be released at: https: //github.com/ArrowLuo/GRACE 7 https://github.com/huminghao16/ SpanABSA extraction. We use the BERT+GRU for D L and BERT+SAN for D R as our baselines due to their best-reported performance. Besides, we produce the results on D T with BERT+SAN keeping the settings the same as on D R 8 . We compare our model with the above baselines on D L , D R , and D T , and compare it with the following baselines on D L , D R-14 , D R-15 , and D R-16 because of the common datasets reported by the official implementation. IMN (He et al., 2019) uses an interactive architecture with multi-task learning for end-to-end ABSA tasks. It contains aspect term and opinion term extraction besides aspect-level sentiment classification. DREGCN (Liang et al., 2020a) designs a dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-toend ABSA. DREGCN is short for the official DREGCN+CNN+BERT due to its better performance. WHW (Peng et al., 2020) develops a two-stage framework to address aspect term extraction, aspect sentiment classification, and opinion extraction. TAS-BERT (Wan et al., 2020) proposes a method based on BERT-Base that can capture the dependence on both aspect terms and categories for sentiment prediction. TAS-BERT is short for the official TAS-BERT-SW-BIO-CRF due to its better performance.
IKTN+BERT (Liang et al., 2020b) discriminately transfers the document-level linguistic knowledge to aspect term, opinion term extraction, and aspectlevel sentiment classification. DHGNN  presents a dynamic heterogeneous graph to model the aspect extraction and sentiment detection explicitly jointly. RACL-BERT (Chen and Qian, 2020) is built on BERT-Large and allows the aspect term extraction, opinion term extraction, and aspect-level sentiment classification to work coordinately via the multitask learning and relation propagation mechanisms in a stacked multi-layer network.

Results and Analysis
Comparison Results. The comparison results are shown in Table 2 and Table 3 Table 3: Comparison results of F1 score (%) for aspect term-polarity pair extraction on four benchmark datasets. '-' denotes unreported results. '-w/o GHL', '-w/o VAT', and '-w/o PTR' have the same meaning as which in Table 2. tains the best F1 score across all datasets and significantly outperforms the strongest baselines in most cases on aspect term-polarity co-extraction. Compared to the state-of-the-art pipeline approach, the GRACE outperforms SPAN Base by 8.50%, 6.50%, and 2.07% on D L , D R , and D T , respectively. Even comparing to SPAN Large built on 24-layers BERT-Large, the improvements are still 2.65%, 3.15%, and 0.59% on D L , D R , and D T , respectively. It indicates that a carefully-designed joint model has capable of achieving better performance than pipeline approaches on our task. Compared to other multi-task models containing additional information, e.g., opinion terms and aspect term categories, the GRACE achieves absolute gains over the IMN, WHW, TAS-BERT, IKTN+BERT,  and RACL-BERT at least by 7.31%, 1.84%, 2.05%, and 0.81% on D L , D R-14 , D R-15 , D R-16 , respectively. It suggests that GRACE can extend to more tasks of ABSA. Ablation Study. To study the effectiveness of the gradient harmonized loss (GHL), VAT, and postpretraining, we conduct ablation experiments on each of them. The results are shown in the second block in Table 2 and Table 3. We can see that the scores drop more seriously without GHL comparing to that without VAT. It points out that GRACE can benefit more from the gradient harmonized loss than VAT, and alleviate the imbalance issue of labels is more important to the sequence labeling. The drop of scores without post-training is the worst on all laptop and restaurant datasets, which indicates that the domain-specific knowledge can improve the task-related datasets massively.

Results on ATE.
As an extra output of the proposed GRACE, we also compare ATE results with  Case Study. Table 5 shows some examples of BASE, GRACE without gradient harmonized loss (w/o GHL), and GRACE sampled from D L and D R . As observed in the first two examples, the GRACE incorrectly predicts both aspect terms and their sentiments. Comparing with the BASE, we believe the cascaded labeling strategy can make an interaction between aspect terms within a sentence, which enhances the judgment of sentiment labels. The last two rows indicate that GRACE can get correct results, even the CON is minimal. The reason is not only the more comprehensive information proved by cascaded labeling strategy but also the balance of labels given by gradient harmonized loss.

Related Work
Aspect term extraction and aspect sentiment classification are two major topics of aspect-based sentiment analysis. Many researchers have studied each of them for a long time. For the ATE task, unsupervised methods such as frequent pattern mining (Hu and Liu, 2004), rule-based approach (Qiu et al., 2011;Liu et al., 2015), topic modeling (He et al., 2011;Chen et al., 2014), and supervised methods such as sequence labeling based models (Wang et al., 2016a;Yin et al., 2016;Xu et al., 2018;Luo et al., 2019a;Ma et al., 2019) are two main directions. For the ASC task, the relation or position between the aspect terms and the surrounding context words are usually used (Tang et al., 2016;Laddha and Mukherjee, 2016). Besides, there are some other approaches, such as convolution neural networks (Poria et al., 2016;Li and Xue, 2018), attention networks (Wang et al., 2016b;Ma et al., 2017;He et al., 2017), memory networks , capsule network (Chen and Qian, 2019), and graph neural networks (Wang et al., 2020). We regard ATE and ASC as two parallel sequence labeling tasks in this paper. Compared with the separate methods, this approach can concisely generate all aspect term-polarity pairs of input sentences. Like our work, Mitchell et al. (2013) and  are also about performing two sequence labeling tasks, but they extract named entities and their sentiment classes jointly. We have a different objective and utilize a different model. Li and Lu (2017), Ma et al. (2018) and  have the same objective as us. The main difference is that their approaches belong to a collapsed approach, but ours is a joint approach. Luo et al. (2019b) use joint approach like ours, they fo-cus on the interaction between two tasks, and some extra objectives are designed to assist the extraction. Hu et al. (2019) consider the ATE as a span extraction question, and extract aspect term and its sentiment polarity using a pipeline approach. There are some other approaches to address these two tasks (Li et al., 2019c;He et al., 2019;Liang et al., 2020a;Peng et al., 2020;Wan et al., 2020;Liang et al., 2020b;Chen and Qian, 2020). However, almost all of previous models do not concern the imbalance of labels in such sequence labeling tasks. To the best of our knowledge, this is the first work to alleviate the imbalance issue in the ABSA.

Conclusion
In this paper, we proposed a novel framework GRACE to solve aspect term extraction and aspect sentiment classification simultaneously. The proposed framework adopted a cascaded labeling approach to enhance the interaction between aspect terms and improve the attention of sentiment tokens for each term by a multi-head attention architecture. Besides, we alleviated the imbalance issue of labels in our labeling tasks by a gradient harmonized method borrowed from object detection. The virtual adversarial training and post-training on domain datasets were also introduced to improve the extraction performance. Experimental results on three benchmark datasets verified the effectiveness of GRACE and showed that it significantly outperforms the baselines on aspect term-polarity co-extraction.