Learning to Detect Opinion Snippet for Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) is to predict the sentiment polarity towards a particular aspect in a sentence. Recently, this task has been widely addressed by the neural attention mechanism, which computes attention weights to softly select words for generating aspect-specific sentence representations. The attention is expected to concentrate on opinion words for accurate sentiment prediction. However, attention is prone to be distracted by noisy or misleading words, or opinion words from other aspects. In this paper, we propose an alternative hard-selection approach, which determines the start and end positions of the opinion snippet, and selects the words between these two positions for sentiment prediction. Specifically, we learn deep associations between the sentence and aspect, and the long-term dependencies within the sentence by leveraging the pre-trained BERT model. We further detect the opinion snippet by self-critical reinforcement learning. Especially, experimental results demonstrate the effectiveness of our method and prove that our hard-selection approach outperforms soft-selection approaches when handling multi-aspect sentences.


Introduction
Aspect-based sentiment analysis (Pang and Lee, 2008;Liu, 2012) is a fine-grained sentiment analysis task which has gained much attention from research and industries. It aims at predicting the sentiment polarity of a particular aspect of the text. With the rapid development of deep learning, this task has been widely addressed by attentionbased neural networks (Wang et al., 2016;Ma et al., 2017;Cheng et al., 2017;Tay et al., 2018; Figure 1: Example of attention visualization. The attention weights of the aspect place are from the model ATAE-LSTM (Wang et al., 2016), a typical attention mechanism used for soft-selection. Wang et al., 2018a). To name a few, Wang et al. (2016) learn to attend on different parts of the sentence given different aspects, then generates aspect-specific sentence representations for sentiment prediction. Tay et al. (2018) learn to attend on correct words based on associative relationships between sentence words and a given aspect. These attention-based methods have brought the ABSA task remarkable performance improvement.
Previous attention-based methods can be categorized as soft-selection approaches since the attention weights scatter across the whole sentence and every word is taken into consideration with different weights. This usually results in attention distraction (Li et al., 2018b), i.e., attending on noisy or misleading words, or opinion words from other aspects. Take Figure 1 as an example, for the aspect place in the sentence "the food is usually good but it certainly is not a relaxing place to go", we visualize the attention weights from the model ATAE-LSTM (Wang et al., 2016). As we can see, the words "good" and "but" are dominant in attention weights. However, "good" is used to describe the aspect food rather than place, "but" is not so related to place either. The true opinion snippet "certainly is not a relaxing place" receives low attention weights, leading to the wrong prediction towards the aspect place.
Therefore, we propose an alternative hardselection approach by determining two positions in the sentence and selecting words between these two positions as the opinion expression of a given aspect. This is also based on the observation that opinion words of a given aspect are usually distributed consecutively as a snippet (Wang and Lu, 2018). As a consecutive whole, the opinion snippet may gain enough attention weights, avoid being distracted by other noisy or misleading words, or distant opinion words from other aspects. We then predict the sentiment polarity of the given aspect based on the average of the extracted opinion snippet. The explicit selection of the opinion snippet also brings us another advantage that it can serve as justifications of our sentiment predictions, making our model more interpretable.
To accurately determine the two positions of the opinion snippet of a particular aspect, we first model the deep associations between the sentence and aspect, and the long-term dependencies within the sentence by BERT (Devlin et al., 2018), which is a pre-trained language model and achieves exciting results in many natural language tasks. Second, with the contextual representations from BERT, the two positions are sequentially determined by self-critical reinforcement learning. The reason for using reinforcement learning is that we do not have the ground-truth positions of the opinion snippet, but only the polarity of the corresponding aspect. Then the extracted opinion snippet is used for sentiment classification. The details are described in the model section.
The main contributions of our paper are as follows: • We propose a hard-selection approach to address the ABSA task. Specifically, our method determines two positions in the sentence to detect the opinion snippet towards a particular aspect, and then uses the framed content for sentiment classification. Our approach can alleviate the attention distraction problem in previous soft-selection approaches.
• We model deep associations between the sentence and aspect, and the long-term dependencies within the sentence by BERT. We then learn to detect the opinion snippet by self-critical reinforcement learning.
• The experimental results demonstrate the effectiveness of our method and also our approach significantly outperforms softselection approaches on handling multiaspect sentences.

Related Work
Traditional machine learning methods for aspectbased sentiment analysis focus on extracting a set of features to train sentiment classifiers (Ding et al., 2009;Boiy and Moens, 2009;Jiang et al., 2011), which usually are labor intensive. With the development of deep learning technologies, neural attention mechanism (Bahdanau et al., 2014) has been widely adopted to address this task (Tang et al., 2015;Wang et al., 2016;Tang et al., 2016;Ma et al., 2017;Chen et al., 2017;Cheng et al., 2017;Li et al., 2018a;Wang et al., 2018a;Tay et al., 2018;Hazarika et al., 2018;Majumder et al., 2018;Fan et al., 2018;Wang et al., 2018b). Wang et al. (2016) propose attention-based LSTM networks which attend on different parts of the sentence for different aspects. Ma et al. (2017) utilize the interactive attention to capture the deep associations between the sentence and the aspect. Hierarchical models (Cheng et al., 2017;Li et al., 2018a;Wang et al., 2018a) are also employed to capture multiple levels of emotional expression for more accurate prediction, as the complexity of sentence structure and semantic diversity. Tay et al. (2018) learn to attend based on associative relationships between sentence words and aspect. All these methods use normalized attention weights to softly select words for generating aspect-specific sentence representations, while the attention weights scatter across the whole sentence and can easily result in attention distraction. Wang and Lu (2018) propose a hard-selection method to learn segmentation attention which can effectively capture the structural dependencies between the target and the sentiment expressions with a linearchain conditional random field (CRF) layer. However, it can only address aspect-term level sentiment prediction which requires annotations for aspect terms. Compared with it, our method can handle both aspect-term level and aspect-category level sentiment prediction by detecting the opinion snippet.

Model
We first formulate the problem. Given a sentence S = {w 1 , w 2 , ..., w N } and an aspect A = {a 1 , a 2 , ..., a M }, the ABSA task is to predict the Aspect Sentence

Hard Selection
Sample Sample Figure 2: Network Architecture. We leverage BERT to model the relationships between sentence words and a particular aspect. The sentence and aspect are packed together into a single sequence and fed into BERT, in which E represents the input embedding, and T i represents the contextual representation of token i. With the contextual representations from BERT, the start and end positions are sequentially sampled and then the framed content is used for sentiment prediction. Reinforcement learning is adopted for solving the nondifferentiable problem of sampling. sentiment of A. In our setting, the aspect can be either aspect terms or an aspect category. As aspect terms, A is a snippet of words in S, i.e., a sub-sequence of the sentence, while as an aspect category, A represents a semantic category with M = 1, containing just an abstract token.
In this paper, we propose a hard-selection approach to solve the ABSA task. Specifically, we first learn to detect the corresponding opinion snippet O = {w l , w l+1 ..., w r }, where 1 ≤ l ≤ r ≤ N , and then use O to predict the sentiment of the given aspect. The network architecture is shown in Figure 2.

Word-Aspect Fusion
Accurately modeling the relationships between sentence words and an aspect is the key to the success of the ABSA task. Many methods have been developed to model word-aspect relationships. Wang et al. (2016) simply concatenate the aspect embedding with the input word embeddings and sentence hidden representations for computing aspect-specific attention weights. Ma et al. (2017) learn the aspect and sentence interactively by using two attention networks. Tay et al. (2018) adopt circular convolution of vectors for performing the word-aspect fusion.
In this paper, we employ BERT (Devlin et al., 2018) to model the deep associations between the sentence words and the aspect. BERT is a powerful pre-trained model which has achieved remarkable results in many NLP tasks. The architecture of BERT is a multi-layer bidirectional Transformer Encoder (Vaswani et al., 2017), which uses the self-attention mechanism to capture complex interaction and dependency between terms within a sequence. To leverage BERT to model the relationships between the sentence and the aspect, we pack the sentence and aspect together into a single sequence and then feed it into BERT, as shown in Figure 2. With this sentence-aspect concatenation, both the word-aspect associations and wordword dependencies are modeled interactively and simultaneously. With the contextual token representations T S = T [1:N ] ∈ R N ×H of the sentence, where N is the sentence length and H is the hidden size, we can then determine the start and end positions of the opinion snippet in the sentence.

Soft-Selection Approach
To fairly compare the performance of softselection approaches with hard-selection approaches, we use the same word-aspect fusion results T S from BERT. We implement the attention mechanism by adopting the approach similar to the work (Lin et al., 2017).
where v 1 ∈ R H and W 1 ∈ R H×H are the parameters. The normalized attention weights α are used to softly select words from the whole sentence and generate the final aspect-specific sentence representation g. Then we make sentiment prediction as follows:ŷ where W 2 ∈ R C×H and b ∈ R C are the weight matrix and bias vector respectively.ŷ is the probability distribution on C polarities. The polarity with highest probability is selected as the prediction.

Hard-Selection Approach
Our proposed hard-selection approach determines the start and end positions of the opinion snippet and selects the words between these two positions for sentiment prediction. Since we do not have the ground-truth opinion snippet, but only the polarity of the corresponding aspect, we adopt reinforcement learning (Williams, 1992) to train our model. To make sure that the end position comes after the start position, we determine the start and end sequentially as a sequence training problem (Rennie et al., 2017). The parameters of the network, Θ, define a policy p θ and output an action that is the prediction of the position. For simplicity, we only generate two actions for determining the start and end positions respectively. After determining the start position, the "state" is updated and then the end is conditioned on the start. Specifically, we define a start vector s ∈ R H and an end vector e ∈ R H . Similar to the prior work (Devlin et al., 2018), the probability of a word being the start of the opinion snippet is computed as a dot product between its contextual token representation and s followed by a softmax over all of the words of the sentence.
We then sample the start position l based on the multinomial distribution β l . To guarantee the end comes after the start, the end is sampled only in the right part of the sentence after the start. Therefore, the state is updated by slicing operation T S r = T S [l :]. Same as the start position, the end position r is also sampled based on the distribution β r : Then we have the opinion snippet T O = T S [l : r] to predict the sentiment polarity of the given aspect in the sentence. The probabilities of the start position at l and the end position at r are p(l) = β l [l] and p(r) = β r [r] respectively.

Reward
After we get the opinion snippet T O by the sampling of the start and end positions, we compute the final representation g o by the average of the opinion snippet, g o = avg(T O ). Then, equation 2 with different weights is applied for computing the sentiment predictionŷ o . The cross-entropy loss function is employed for computing the reward.
where c is the index of the polarity class and y is the ground truth.

Self-Critical Training
In this paper, we use reinforcement learning to learn the start and end positions. The goal of training is to minimize the negative expected reward as shown below.
where Θ is all the parameters in our architecture, which includes the base method BERT, the position selection parameters {s, e}, and the parameters for sentiment prediction and then for reward calculation. Therefore, the state in our method is the combination of the sentence and the aspect. For each state, the action space is every position of the sentence.
To reduce the variance of the gradient estimation, the reward is associated with a reference reward or baseline R b (Rennie et al., 2017). With the likelihood ratio trick, the objective function can be transformed as.
The baseline R b is computed based on the snippet determined by the baseline policy, which selects the start and end positions greedily by the argmax operation on the sof tmax results. As shown in Figure 2, the reward R is calculated by sampling the snippet, while the baseline R b is computed by greedily selecting the snippet. Note that in the test stage, the snippet is determined by argmax for inference.

Experiments
In this section, we compare our hard-selection model with various baselines. To assess the ability of alleviating the attention distraction, we further conduct experiments on a simulated multi-aspect dataset in which each sentence contains multiple aspects.

Datasets
We use the same datasets as the work by Tay et al. (2018), which are already processed to token lists and released in Github 1 . The datasets are from Se-mEval 2014 task 4 (Pontiki et al., 2014), and Se-mEval 2015 task 12 (Pontiki et al., 2015), respectively. For aspect term level sentiment classification task (denoted by T), we apply the Laptops and

Implementation Details
Our proposed models are implemented in Py-Torch 2 . We utilize the bert-base-uncased model, which contains 12 layers and the number of all parameters is 100M. The dimension H is 768. The BERT model is initialized from the pre-trained model, other parameters are initialized by sampling from normal distribution N (0, 0.02). In our experiments, the batch size is 32. The reported results are the testing scores that fine-tuning 7 epochs with learning rate 5e-5.

Compared Models
• LSTM: it uses the average of all hidden states as the sentence representation for sentiment prediction. In this model, aspect information is not used.
• TD-LSTM (Tang et al., 2015): it employs two LSTMs and both of their outputs are applied to predict the sentiment polarity.
• AT-LSTM (Wang et al., 2016): it utilizes the attention mechanism to produce an aspect-specific sentence representation. This method is a kind of soft-selection approach.
• ATAE-LSTM (Wang et al., 2016): it also uses the attention mechanism. The difference with AT-LSTM is that it concatenates the aspect embedding to each word embedding as the input to LSTM.
• AF-LSTM(CORR) (Tay et al., 2018): it adopts circular correlation to capture the deep fusion between sentence words and the aspect, which can learn rich, higher-order relationships between words and the aspect.
• AF-LSTM(CONV) (Tay et al., 2018): compared with AF-LSTM(CORR), this method applies circular convolution of vectors for performing word-aspect fusion to learn relationships between sentence words and the aspect.
• BERT-Original: it makes sentiment prediction by directly using the final hidden vector C from BERT with the sentence-aspect pair as input.

Our Models
• BERT-Soft: as described in Section 3.2, the contextual token representations from BERT are processed by self attention mechanism (Lin et al., 2017) and the attention-weighted sentence representation is utilized for sentiment classification.
• BERT-Hard: as described in Section 3.3, it takes the same input as BERT-Soft. It is called a hard-selection approach since it employs reinforcement learning techniques to explicitly select the opinion snippet corresponding to a particular aspect for sentiment prediction.

Experimental Results
In this section, we evaluate the performance of our models by comparing them with various baseline models. Experimental results are illustrated in Table 2, in which 3-way represents 3-class sentiment classification (positive, negative and neutral) and Binary denotes binary sentiment prediction (positive and negative). The best score of each column is marked in bold.   (Tay et al., 2018). Avg column presents macro-averaged results across all the datasets.
Firstly, we observe that BERT-Original, BERT-Soft, and BERT-Hard outperform all soft attention baselines (in the first part of Table 2), which demonstrates the effectiveness of fine-tuning the pre-trained model on the aspect-based sentiment classification task.
Considering the average score across eight settings, BERT-Original outperforms AF-LSTM(CONV) by 6.46%, BERT-Soft outperforms AF-LSTM(CONV) by 6.47% and BERT-Hard outperforms AF-LSTM(CONV) by 7.19% respectively. Secondly, we compare the performance of three BERT-related methods. The performance of BERT-Original and BERT-Soft are similar by comparing their average scores. The reason may be that the original BERT has already modeled the deep relationships between the sentence and the aspect. BERT-Original can be thought of as a kind of soft-selection approach as BERT-Soft. We also observe that the snippet selection by reinforcement learning improves the performance over soft-selection approaches in almost all settings. However, the improvement of BERT-Hard over BERT-Soft is marginal. The average score of BERT-Hard is better than BERT-Soft by 0.68%. The improvement percentages are between 0.36% and 1.49%, while on the Laptop dataset, the performance of BERT-Hard is slightly weaker than BERT-Soft. The main reason is that the datasets only contain a small portion of multi-aspect sentences with different polarities. The distraction of attention will not impact the sentiment prediction much in single-aspect sentences or multi-aspect sentences with the same polarities.

Experimental Results on Multi-Aspect Sentences
On the one hand, the attention distraction issue becomes worse in multi-aspect sentences. In addition to noisy and misleading words, the attention is also prone to be distracted by opinion words from other aspects of the sentence. On the other hand, the attention distraction impacts the performance of sentiment prediction more in multi-aspect sentences than in single-aspect sentences. Hence, we evaluate the performance of our models on a test dataset with only multi-aspect sentences. A multi-aspect sentence can be categorized by two dimensions: the Number of aspects and the Polarity dimension which indicates whether the sentiment polarities of all aspects are the same or not. In the dimension of Number, we categorize the multi-aspect sentences as 2-3 and More. 2-3 refers to the sentences with two or three aspects while More refers to the sentences with more than three aspects. The statistics in the original dataset shows that there are much more sentences with 2-3 aspects than those with More aspects. In the dimension Polarity, the multi-aspect sentences can be categorized into Same and Diff. Same indicates that all aspects in the sentence have the same sentiment polarity. Diff indicates that the aspects have

different polarities.
Multi-aspect test set. To evaluate the performance of our models on multi-aspect sentences, we construct a new multi-aspect test set by selecting all multi-aspect sentences from the original training, development, and test sets of the Restaurants term-level task. The details are shown in Table 3.
Multi-aspect training set. Since we use all multi-aspect sentences for testing, we need to generate some "virtual" multi-aspect sentences for training. The simulated multi-aspect training set includes the original single-aspect sentences and the newly constructed multi-aspect sentences, which are generated by concatenating multiple single-aspect sentences with different aspects. We keep the balance of each subtype in the new training set (see Table 4). The number of Neutral sentences is the least among three sentiment polarities in all single-aspect sentences. We randomly select the same number of Positive and Negative sentences. Then we construct multi-aspect sentences by combining single-aspect sentences in different combinations of polarities. The naming for different combinations is simple. For example, 2P-1N indicates that the sentence has two positive aspects and one negative aspect, and P-N-Nu means that the three aspects in the sentence are positive, negative, and neutral respectively. For simplicity,  we only construct 2-asp and 3-asp sentences which are also the majority in the original dataset.
Results and Discussions. The results on different types of multi-aspect sentences are shown in Table 5. The performance of BERT-Hard is better than BERT-Original and BERT-Soft over all types of multi-aspect sentences. BERT-Hard outperforms BERT-Soft by 2.11% when the aspects have the same sentiment polarities. For multiaspect sentences with different polarities, the improvements are more significant. BERT-Hard outperforms BERT-Soft by 7.65% in total of Diff. The improvements are 5.07% and 12.83% for the types 2-3 and More respectively, which demonstrates the ability of our model on handling sentences with More aspects. Particularly, BERT-Soft has the poorest performance on the subset Diff among the three methods, which proves that soft attention is more likely to cause attention distraction.
Intuitively, when multiple aspects in the sentence have the same sentiment polarities, even the attention is distracted to other opinion words of other aspects, it can still predict correctly to some extent. In such sentences, the impact of the attention distraction is not obvious and difficult to detect. However, when the aspects have different sentiment polarities, the attention distraction will lead to catastrophic error prediction, which will obviously decrease the classification accuracy. As shown in Table 5, the accuracy of Diff is much worse than Same for all three methods. It means that the type of Diff is difficult to handle. Even though, the significant improvement proves that our hard-selection method can alleviate the attention distraction to a certain extent. For soft-selection methods, the attention distraction is inevitable due to their way in calculating the attention weights for every single word. The noisy or irrelevant words could seize more attention weights than the ground truth opinion words. Our method considers the opinion snippet as a Negative Negative the appetizers are ok, but the service is slow the appetizers are ok, but the service is slow the appetizers are ok, but the service is slow the appetizers are ok, but the service is slow consecutive whole, which is more resistant to attention distraction.

Visualization
In this section, we visualize the attention weights for BERT-Soft and opinion snippets for BERT-Hard. As demonstrated in Figure 3, the multiaspect sentence "the appetizers are OK, but the service is slow" belongs to the category Diff. Firstly, the attention weights of BERT-Soft scatter among the whole sentence and could attend to irrelevant words. For the aspect service, BERT-Soft attends to the word "ok" with relatively high score though it does not describe the aspect service. This problem also exists for the aspect appetizers. Furthermore, the attention distraction could cause error prediction. For the aspect appetizers, "but" and "slow" gain high attention scores and cause the wrong sentiment prediction Negative.
Secondly, our proposed method BERT-Hard can detect the opinion snippet for a given aspect. As illustrated in Figure 3, the opinion snippets are selected by BERT-Hard accurately. In the sentence "the appetizers are ok, but the service is slow", BERT-Hard can exactly locate the opinion snippets "ok" and "slow" for the aspect appetizers and service respectively.
At last, we enumerate some opinion snippets detected by BERT-Hard in Table 6. Our method can precisely detect snippets even for latent opinion expression and alleviate the influence of noisy words. For instance, "cannot be beat for the quality" is hard to predict using soft attention because the sentiment polarity is transformed by the negative word "cannot". Our method can select the whole snippet without bias to any word and in this way the attention distraction can be alleviated. We also list some inaccurate snippets in Table 7. Some meaningless words around the true snippet are included, such as "are", "and" and "at". These

Positive Snippets
Negative Snippets very good prompt attentive not great bland beautifully presented can not eat this well extremely tasty unbearable conversation as interesting as possible no idea how to use cool and soothing would never go there impressed by not above ordinary cannot be beat for the quality not good  words do not affect the final prediction. A possible explanation to these inaccurate words is that the true snippets are unlabeled and our method predicts them only by the supervisory signal from sentiment labels.

Conclusion
In this paper, we propose a hard-selection approach for aspect-based sentiment analysis, which determines the start and end positions of the opinion snippet for a given input aspect. The deep associations between the sentence and aspect, and the long-term dependencies within the sentence are taken into consideration by leveraging the pretrained BERT model. With the hard selection of the opinion snippet, our approach can alleviate the attention distraction problem of traditional attention-based soft-selection methods. Experimental results demonstrate the effectiveness of our method. Especially, our hard-selection approach outperforms soft-selection approaches significantly when handling multi-aspect sentences with different sentiment polarities.

Acknowledgement
This work is supported by National Science and Technology Major Project, China (Grant No. 2018YFB0204304).