Label Correction Model for Aspect-based Sentiment Analysis

Aspect-based sentiment analysis includes opinion aspect extraction and aspect sentiment classification. Researchers have attempted to discover the relationship between these two sub-tasks and have proposed the joint model for solving aspect-based sentiment analysis. However, they ignore a phenomenon: aspect boundary label and sentiment label of the same word can correct each other. To exploit this phenomenon, we propose a novel deep learning model named the label correction model. Specifically, given an input sentence, our model first predicts the aspect boundary label sequence and sentiment label sequence, then re-predicts the aspect boundary (sentiment) label sequence using the embeddings of the previously predicted sentiment (aspect boundary) label. The goal of the re-prediction operation (can be repeated multiple times) is to use the information of the sentiment (aspect boundary) label to correct the wrong aspect boundary (sentiment) label. Moreover, we explore two ways of using label embeddings: add and gate mechanism. We evaluate our model on three benchmark datasets. Experimental results verify that our model achieves state-of-the-art performance compared with several baselines.


Introduction
Aspect-based sentiment analysis (ABSA) aims to extract the opinion aspects mentioned in the sentence and to predict the sentiment of each aspect (Pontiki et al., 2014). For example, in the review "Average to good thai food , but terrible delivery .", thai food and delivery are aspects, with positive and negative as their corresponding sentiments.
In the literature, ABSA is usually divided into two sub-tasks, namely, opinion aspect extraction and aspect sentiment classification. The goal of opinion aspect extraction is to extract all aspects present in a sentence. Early work (Zhuang et al., 2006) focuses on detecting the pre-defined aspects in a sentence. Then, some work (Jakob and Gurevych, 2010;Liu et al., 2015) regards the opinion aspect extraction as a sequence labeling task. The goal of aspect sentiment classification is to predict the sentiment of the given opinion aspect, and this sub-task has drawn growing research interests over the past few years (Ma et al., 2017;Wang et al., 2018;Li et al., 2019b). However, most previous work focuses on only one of the sub-tasks. This limits the practical application of ABSA. To apply the existing methods of two sub-tasks in practical applications (i.e., not only extracting aspects but also predicting their sentiments), the most common way is to pipeline the methods of two sub-tasks together.
To further promote the resolution of ABSA, researchers have attempted to discover the relationship between these two sub-tasks and have developed the joint models (Mitchell et al., 2013;Zhang et al., 2015). They utilize a set of aspect boundary labels (e.g., B, I, E, S, O) and a set of sentiment labels (e.g., POS, NEG, NEU and O) to make the models of two sub-tasks jointly trained. In other words, ABSA is modeled as an extension to opinion aspect extraction, where an extra sentiment label is assigned to each word, in addition to the aspect boundary label. Table 1 gives an example of the labeling scheme in the  To make use of this phenomenon to assist ABSA, in this paper, we propose a novel neural networks model named the label correction model. Figure 1 shows the architecture of our model. For a joint model solution to ABSA, each word in the input sentence has two labels (i.e., aspect boundary label and sentiment label), thus our motivation is to use one label to correct the other as much as possible. During training, our model utilizes the ground truth label sequence to improve performance directly. In particular, our model uses the ground truth sentiment label embeddings and hidden states to predict the aspect boundary label sequence. For the prediction of the sentiment label sequence, the operation is similar. During testing, we first predict the aspect boundary label sequence and sentiment label sequence of the input sentence, and such predictions are based solely on hidden states. Then, these two label sequences are employed as features for predictions for the next round. Such design is based on the assumption that in the process of prediction, the proposed model has the ability to correct the label by adopting another label as a feature. Note that the re-prediction operation can be repeated many times. In this work, we try two different ways to use label embeddings: add and gate mechanism. The former integrates the label embeddings into the hidden states by adding. The latter consisting of a sigmoid layer and a pointwise multiplication operation is a way to optionally let label embeddings information through.
To summarize, we make the following contributions in this paper: • We manifest a phenomenon ignored by researchers that aspect boundary label and sentiment label of the same word can correct each other in the joint model of ABSA.
• We propose a novel model utilizing this phenomenon to solve ABSA. To the best of our knowledge, our model is the first work related to this phenomenon.
• Experimental results on three benchmark datasets show that our model outperforms the strong joint baselines. We conduct an ablation study to quantitatively demonstrate the effectiveness of exploiting this phenomenon.

Related Work
Aspect-based sentiment analysis is usually divided into two sub-tasks, namely, opinion aspect extraction and aspect sentiment classification.
Opinion Aspect Extraction Opinion aspect extraction is a fundamental task in ABSA and aims at extracting all aspects present in a sentence. Hu and Liu (2004) first proposed to evaluate the sentiment of different aspects in an opinion sentence, and all aspects are predefined artificially. However, the manually defined aspects cannot cover all aspects appearing in a sentence. Therefore, many researchers turn to extract all possible aspects in a sentence and model the opinion aspect extraction as a sequence labeling problem. They initially used traditional machine learning (Jin and Ho, 2009;Jakob and Gurevych, 2010;Liu et al., 2013) to solve this sub-task. With the developments of deep learning, neural networks based methods (Liu et al., 2015;Wang et al., 2016;Xu et al., 2018) have achieved better performance on opinion aspect extraction.
Aspect Sentiment Classification Aspect sentiment classification is often interpreted as a multi-class classification problem, assuming that the aspects are given. Traditional approaches usually first manually build a set of features and then run them through SVM (Jiang et al., 2011;Wagner et al., 2014). The feature-based models depend on the quality of laborious feature engineering work and are laborintensive. Therefore, recent work mainly focuses on capturing the interaction between the aspect and the sentence, by utilizing various neural architectures such as LSTM (Tang et al., 2016a) with attention mechanism (Li et al., 2018a), CNN (Xue and Li, 2018), Memory Networks (Tang et al., 2016b;Chen et al., 2017), and pre-trained models (Wang and Ren, 2020).
Aspect-based Sentiment Analysis Aspect-based sentiment analysis needs to directly predict the sentiment towards an aspect along with discovering the aspect itself. In addition to the pipeline methods, previous work attempted to discover the relationship between its two sub-tasks and presented a more integrated solution to solve ABSA. Specifically, Mitchell et al. (2013) formulated ABSA as a sequence labeling problem and proposed to use CRF with hand-crafted linguistic features. Zhang et al. (2015) leveraged the linguistic features and word embeddings to further improve the performance of the CRF based model. Recently, Li et al. (2019a) proposed a unified model, which contains two stacked LSTMs along with carefully-designed components for maintaining sentiment consistency and improving aspect detection, and achieved state-of-the-art results. However, researchers overlooked a phenomenon: aspect boundary label and sentiment label of the same word can correct each other. In this paper, we further improve performance on ABSA by taking advantage of this phenomenon.

Problem Formulation
We formulate the complete ABSA as two sequence labeling problems and employ a set of aspect boundary labels Y B and a set of sentiment labels Y S to accomplish opinion aspect extraction sub-task and aspect sentiment classification sub-task, respectively. Here, Y B = {B, I, E, S, O} (short for beginning, inside, ending, single token, outside) and Y S = {POS, NEG, NEU, O} (short for positive, negative, neutral, outside). For a given input sequence X = {x 1 , ..., x T } with length T, the goal of our model is to Based on these two label sequences, we can obtain the extracted aspects along with corresponding sentiments. Taking the sentence in Table 1 as an example, the extracted aspects are thai food and delivery, whose sentiments are positive and negative, respectively. Figure 1 shows the architecture of our model. Our model consists of three components, which are the encoder, the add and gate mechanism, and the CRF. The encoder is BERT 1 (Devlin et al., 2019) which maps word embeddings into contextualized hidden states using the pre-trained transformer blocks (Vaswani et al., 2017). Then, the add or gate mechanism is utilized to fuse the hidden states and the label embeddings, aiming at using one label's information to enhance the prediction of another. Note that we use the ground truth label sequence during training but label sequence from the latest round of prediction during testing. Finally, to predict the sentiment label sequence and aspect boundary label sequence, the two outputs are fed into two CRFs, respectively.

Encoder
We use BERT (Devlin et al., 2019) as our text encoder. In this work, we first tokenize the sentence X using WordPiece embeddings (Wu et al., 2016) with a 30,522 token vocabulary and then generate the input sequence X by concatenating a [CLS] token, the tokenized sentence, and a [SEP] token. Then for each token x i in X, we convert it into vector space by summing the token, segment, and position embeddings, thus yielding the input embeddings H 0 ∈ R (T +2)×d , where d is the hidden state dimension. Next, we use L stacked transformer blocks to project the input embeddings into a sequence of contextual hidden states H i ∈ R (T +2)×d as: Here, we omit an exhaustive description of the transformer block and refer readers to Vaswani et al. (2017) for more details.

Add and Gate Mechanism
As the example in Table 1 shows, each word in the input sentence has two labels (i.e., aspect boundary label and sentiment label) in the joint model. Our motivation is to use one label to correct the other.
To input labels as features, we first convert each label in two label sequences (Y B and Y S ) 2 into vector space, so producing the boundary label embeddings H B ∈ R (T +2)×d and the sentiment label embeddings H S ∈ R (T +2)×d . We then propose two ways to use the label embeddings.
Add As the name suggests, the add is a pretty simple way to incorporate the label embeddings directly into the hidden states: Although this way is uncomplicated, it is not very effective. When our model predicts the sentiment labels, it considers the hidden states and the boundary label embeddings to be equally important, and vice versa. However, this may not be the case. Because the contextualized hidden states may play a greater role when predicting the sentiment labels for certain words. Taking the sentence in Table 1 as an example, when predicting the sentiment label of word delivery, the contextual information containing word terrible is significantly more useful than its own boundary label S. Therefore, we propose the gate mechanism to solve the shortcoming of this way.
Gate Mechanism For label prediction of different words, the contextual hidden states and the label embeddings should play significantly different roles. Inspired by gates in LSTM, we propose the gate mechanism to select important information from the hidden states and the label embeddings to forecast label. The gate mechanism can construct tailored hidden states by considering the label embeddings. In detail, it takes the hidden states H L and the label embeddings H B(S) as input and outputs a gate matrix G B(S) to select H L and H B(S) :  where σ denotes the sigmoid activation function and is element-wise multiplication. From Equation 3, we can find G B(S) ∈ R (T +2)×d is a matrix whose values are between 0 and 1. Therefore, we can utilize ||G B(S) i || 2 3 to measure the degree of the filter and call it 2-Norm Gate value. A high value means most of the information in H L i is passed from the filter, which results in predicting labels with more contextual information. A low value allows the label embeddings to pass through the filter. This illustrates that using only contextual information is not sufficient, and the information of the other label can assist and correct the prediction of the label.

Conditional Random Field
It has been shown that CRF (Lafferty et al., 2001) can produce higher labeling accuracy in sequence labeling tasks because it considers the correlations between labels in neighborhoods. Therefore, instead of decoding each label independently with the softmax layer, we predict the sentiment label sequence and aspect boundary label sequence by using two CRFs. In the training stage, we maximize the loglikelihood function: where the label sequence likelihood P (Y B(S) |X) can be computed by the softmax equation. In the test stage, the Viterbi algorithm (Forney, 1973) is used to output the optimized label sequence.

Setup
Datasets We conduct experiments on the three benchmark ABSA datasets. Table 2 gives the statistics of three datasets. Laptop contains product reviews from the laptop domain and the train-test split is the same as the original dataset (Pontiki et al., 2014). Rest is the union set of the restaurant domain from SemEval 2014 (Pontiki et al., 2014), 2015 (Pontiki et al., 2015), and 2016 (Pontiki et al., 2016). The new training dataset 4 is obtained by merging three years' training set and the new testing set is built in the same way. Twitter is built by Mitchell et al. (2013), which consists of twitter posts. For Laptop and Rest, we regard 10% randomly held-out training data as the development set. For Twitter, we follow previous work (Zhang et al., 2015;Li et al., 2019a) and report the ten-fold cross-validation results as there is no standard train-test split.

Model Settings
We use the publicly available BERT's code 5 to implement our encoder. The hyperparameters (e.g., word pieces vocabulary size, hidden size of encoder, and learning rate) and optimizer of our model are the same as that of BERT, and we employ the uncased pre-trained model to initialize our WordPiece embeddings and encoder's parameters. We set the dimensions of boundary label embeddings and sentiment label embeddings to 768, and these two label embeddings are randomly initialized from N (0, 1) and learned from scratch. The batch size is 16 and a dropout (Srivastava et al., 2014)  Metrices We adopt the precision (P), recall (R), and F1 score as the evaluation metrics. We can obtain the extracted aspects and the corresponding sentiments from the aspect boundary label sequence and the sentiment label sequence, respectively. An extracted aspect is considered to be correct only if it exactly matches the gold aspect and its corresponding sentiment is the same as the gold sentiment label.

Baselines
Models solving ABSA can generally be divided into three categories: pipeline, joint, and collapsed.
The pipeline model first detects aspects from the input text and then predicts sentiments over aspects.
The joint model jointly extracts aspects and predicts their sentiments using a set of aspect boundary labels as well as a set of sentiment labels. The collapsed model uses a set of collapsed labels (e.g., B-POS) to directly indicate the boundary of targeted sentiment. We compare our model with the following methods: • CRF-{pipeline, joint, collapsed} 6 (Mitchell et al., 2013) are three kinds of approaches under the Conditional Random Fields framework.
• NN-CRF-{pipeline, joint, collapsed} 7 (Zhang et al., 2015) enhance the CRF framework by introducing a fully connected layer to consolidate the linguistic features and word embeddings.
• LSTM-CRF 8 (Lample et al., 2016) consists of LSTM and CRF without feature engineering. We consider it as a collapsed model and run the officially released code to produce results.
• HAST-TNet 9 is the pipeline approach of HAST (Li et al., 2018b) and TNet (Li et al., 2018a), which are the current state-of-the-art models on the tasks of aspect boundary detection and aspect sentiment classification respectively. We use the officially released codes to produce the results.
• UNIFIED (Li et al., 2019a) is the current state-of-the-art model on ABSA. This collapsed model enhanced with multi-task learning contains two stacked LSTMs.
• BERT-CLS treats ABSA as a multi-class classification task, classifying each token in the input text into one of the collapsed labels.
• BERT-CRF-{joint, collapsed} is similar to NN-CRF. The difference is that we employ BERT to extract representation features instead of word embeddings and linear layers.

Results
For simplicity, we denote our label correction model as LCM. Here, LCM-Add adds the label embeddings to the hidden states, and LCM-Gate uses the gate mechanism to optionally let the label embeddings information through. Table 3 shows our comparisons with baselines on ABSA. The experimental results suggest that our model consistently exhibits the best F1 scores on the three datasets and significantly outperforms the baselines. Five main observations can be obtained from Table 3. First, although NN-CRF using the word embeddings defeats CRF, it is not strong. Consequently, we add BERT-CRF as another baseline. As shown in Table 3, BERT-CRF achieves much better results than {NN, LSTN}-CRF on all datasets. This indicates that the contextual representations produced by BERT for each token are quite powerful, which is why we chose BERT as the encoder. Second, compared with HAST-TNet, the pipeline of two state-of-the-art models, our model achieves 6.57%, 3.85%, and 4.63% absolute gains on Laptop, Rest, and Twitter respectively, suggesting that a carefully-designed joint model can be more effective than the pipeline  Table 3: Experimental results. The results of the models marked with ' †' are reproduced by us with the officially released code by the authors. The results of the models marked with ' ' are copied from Li et al. (2019a). Average results over three runs with random initialization are reported. State-of-the-art results are marked in bold.

Main Results
model for ABSA task. Third, despite that BERT-{CLS, CRF} baselines show competitiveness compared with the previous best approach UNIFIED, they are all beaten by our model. For example, LCM-Gate achieves 4.83%, 2.3%, and 4.78% absolute gains on the three datasets compared with BERT-CLS, indicating the effectiveness of our model. Fourth, among the joints methods, LCM-{Add, Gate} shows the best performance. This illustrates not only the richness of the features extracted by the encoder but also the efficacy of utilizing the phenomenon that aspect boundary label and sentiment label of the same word can correct each other. Last but not least, we notice that the improvement of our model on the Rest dataset is marginal in contrast with the state-of-the-art baseline UNIFIED. The small gap is reasonable since the Rest dataset contains many informal reviews resulting in the inferior modeling power of models. A similar observation is seen in the comparison of UNIFIED and HAST-TNet.
Ablation Study To prove the usefulness of applying one label to correct the other in the joint model, we perform an ablation study. If our model directly employs the hidden states (H L in Equation 1) to predict the sentiment label sequence and aspect boundary label sequence, our model is equivalent to the BERT-CRF-joint. Consequently, we can choose BERT-CRF-joint as the ablation baseline. As shown in Table 3, compared with LCM-{Add, Gate}, BERT-CRF-joint performs poorly on all three datasets. For example, compared with LCM-Gate, BERT-CRF-joint loses 3.27% F1 scores on the Laptop dataset. This confirms that considering another label embeddings when predicting labels helps improve the performance. Furthermore, we can observe that LCM-Gate is better than LCM-Add on the three datasets. For instance, in terms of F1 scores, LCM-Gate wins about 0.22% absolute gains over LCM-Add on the Laptop dataset. This indicates that the gate mechanism optionally allowing the label embeddings information through is more effective than direct addition.

Discussions
Investigation on the Impact of Number of Repeated Inferences As mentioned above, the inference operation (i.e., rectangular surrounded by the red dotted line in Figure 1) of our model can be repeated multiple times. Here, we investigate the effect of the number of repeated inferences (N ) on the performance. We vary the value of N in the set {1, 2, 3, 4} and plot the corresponding F1 score of LCM on the three datasets. The results are illustrated in Figure 2. As shown in Figure 2, both models (LCM-Add and LCM-Gate) achieve good performance on Laptop and Rest when N is 2, and on Twitter when N is 3. This shows that after certain times of inferences, an increase in the times of inferences does not necessarily improve the performance and the model's ability to correct wrong labels may reach a limit. Besides, we discover that increasing N too big may induce the model to overcorrect the label. It is worth noting that although the performance of both models presents a slight rising trend on the Twitter dataset as N increases, we do not take into account it to do trade-off the effectiveness and efficiency.
Performance against Training Data Size. Our motivation is to use the sentiment (aspect boundary) label to correct the aspect boundary (sentiment) label in the joint model. Under certain conditions, the improvement of our model is more significant when the joint model predicts several wrong labels. To this end, we conduct different experiments on different amounts of training data. Figure 3 shows the performance of baselines and our model against different settings for training with different amounts of training data. Here, we consider three training settings (30%, 60%, and 90%) of the original training data. As shown in Figure 3, LCM-{Add, Gate} consistently outperform BERT-{CLS, CRF} under the same amount of training data on the Laptop and Rest datasets, illustrating the superiority of our model. We can observe that the performance gap becomes more obvious when the size of the training data decreases. Concretely, using 30% of the original training data in the Laptop dataset, LCM-Add can achieve an F1 score of 51.72%, higher than BERT-CRF trained on the same size training data. This demonstrates that when the size of the training data is small, the joint model has a larger error space, resulting in more mistaken labels. In contrast, our model using the label embeddings can reduce the error space and correct the mistaken labels.

Gate Mechanism Visualization
To confirm that our model can select the valuable sentiment (aspect boundary) label information for predicting the aspect boundary (sentiment) label, we visualize the 2-Norm Gate value (see Gate Mechanism subsection) in Figure 4. Our observations are as follows. First, we can see that when a token is an aspect word, its 2-Norm Gate value is relatively small. The underlying reason may be that using only contextual information is not sufficient, thus allowing the label information to pass through the filter. Second, we can find that the different aspect words have different 2-Norm Gate values (e.g., compare the 2-Norm Gate values between log and battery in Figure 4(a)). The reason is that if an aspect word appears more frequently in the dataset, its contextual information can predict correctly its label with the help of little label information. Finally, we observe that the same aspect word has two different 2-Norm Gate values (e.g., compare Left and Right 2-Norm Gate values of log in Figure 4(a)). Here, the y-axis is the 2-Norm Gate value. Left (Right) denotes the 2-Norm Gate value of the gate mechanism when predicting the sentiment (aspect boundary) label (see Figure 1). In the Laptop sentence, log on, wifi connection, and battery life are aspects, with positive as their corresponding sentiments. In the Rest sentence, array of sushi is an aspect, with positive as its corresponding sentiments.
We assume that it can be inferred correctly using the contextual information when the sentiment or aspect boundary label of an aspect word is easy to predict. For example, log is very close to its sentiment cue pleased, which results in using the contextual information easy to infer its sentiment label. Table 4 presents some qualitative cases sampled from the ablation baseline (i.e., BERT-CRF-joint) and LCM-{Add, Gate}. As observed in the first and second sentences, BERT-CRF-joint correctly predicts the aspect boundary (sentiment) but fails to forecast the right sentiment (aspect boundary). By contrast, LCM, where we incorporate the boundary (sentiment) label embeddings into the hidden states to correct the prediction of the sentiment (boundary) label, can correctly handle these two cases, suggesting that our idea using one label to correct the other is capable of improving the performance of ABSA. Besides, we find that LCM-Add may abandon the correction halfway. For example, in the fourth sentence, it successfully corrects the boundary label of harumi with the sentiment label O but  Table 4: Case Study. The "Aspect" column contains the results from the aspect boundary label sequence. The "Sentiment" column presents the sentiment of the aspect, coming from the sentiment label sequence. The marker denotes the incorrect prediction. fails to correct the sentiment label of of with the boundary label I. We attribute it to LCM-Add's inability to selectively utilize the label information. In contrast, LCM-Gate performs better and corrects the label more thoroughly, indicating that the gate mechanism, where optionally let the label information through, is more reasonable and effective compared with the pure addition.

Conclusions
In this paper, we investigate aspect-based sentiment analysis tasks, which can be formulated as two sequence labeling problems with a set of aspect boundary labels and a set of sentiment labels. We observe that the aspect boundary label and sentiment label of the same word can correct each other. Thus, we propose the label correction model that exploits this observation to improve performance. Moreover, we introduce two ways (add and gate mechanism) to make use of label information. We evaluate our proposed model on the three benchmark datasets. Extensive experiments show that our proposed model achieves superior performance.