Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) aims to predict the sentiment towards a specific aspect in the text. However, existing ABSA test sets cannot be used to probe whether a model can distinguish the sentiment of the target aspect from the non-target aspects. To solve this problem, we develop a simple but effective approach to enrich ABSA test sets. Specifically, we generate new examples to disentangle the confounding sentiments of the non-target aspects from the target aspect's sentiment. Based on the SemEval 2014 dataset, we construct the Aspect Robustness Test Set (ARTS) as a comprehensive probe of the aspect robustness of ABSA models. Over 92% data of ARTS show high fluency and desired sentiment on all aspects by human evaluation. Using ARTS, we analyze the robustness of nine ABSA models, and observe, surprisingly, that their accuracy drops by up to 69.73%. Our code and new test set are available at https://github.com/zhijing-jin/ARTS_TestSet.


Introduction
Aspect-based sentiment analysis (ABSA) is an advanced sentiment analysis task that aims to classify the sentiment towards a specific aspect (e.g., burgers or fries in the review "Tasty burgers, and crispy fries."). The key to a strong ABSA model is it is sensitive to only the sentiment words of the target aspect, and therefore not be interfered by the sentiment of any non-target aspect. Although stateof-the-art models have shown high accuracy on existing test sets, we still question their robustness. Specifically, given the prerequisite that a model outputs correct sentiment polarity for the test example, we have the following questions: (Q1) If we reverse the sentiment polarity of the target aspect, can the model change its prediction accordingly?
(Q2) If the sentiments of all non-target aspects become opposite to the target one, can the model still make the correct prediction?
(Q3) If we add more non-target aspects with sentiments opposite to the target one, can the model still make the correct prediction?
A robust ABSA model should both meet the prerequisite and have affirmative answers to all the questions above. For example, if a model makes the correct sentiment classification (i.e., positive) for burgers in the original sentence "Tasty burgers, and crispy fries", it should flip its prediction (to negative) when seeing the new context "Terrible burgers, but crispy fries". Hence, these questions together form a probe to verify if an ABSA model has high aspect robustness.
Unfortunately, existing ABSA datasets have very limited capability to probe the aspect robustness. For example, the Twitter dataset (Dong et al., 2014) has only one aspect per sentence, so the model does not need to discriminate against non-target aspects. In the most widely used SemEval 2014 Laptop and Restaurant datasets (Pontiki et al., 2014), for 83.9% and 79.6% samples in the test sets, the sentiments of the target aspect, and all non-target aspects are all the same. Hence, we cannot decide whether models that make correct classifications attend only to the target aspect, because they may also wrongly look at the non-target aspects, which are confounding factors. Only a small portion of the test set can be used to answer our target questions proposed in the beginning. Moreover, when we test on the subset of the test set (59 samples in Laptop,and 122 samples in Restaurant) where the target aspect sentiment differs from all non-target aspect sentiments (so that the confounding factor is disentangled), the best model (Xu et al., 2019a)  REVTGT: Reverse the sentiment of the target aspect Terrible burgers, but crispy fries. Q2 REVNON: Reverse the sentiment of the non-target aspects with originally the same sentiment as target Tasty burgers, but soggy fries.

Q3
ADDDIFF: Add aspects with the opposite sentiment from the target aspect Tasty burgers, crispy fries, but poorest service ever! vious models may over-rely on the confounding non-target aspects, but not necessarily on the target aspect only. However, no datasets can be used to analyze the aspect robustness more in depth. We develop an automatic generation framework that takes as input the original test samples from Se-mEval 2014, and applies three generation strategies showed in Table 1. Samples generated by REVTGT, REVNON, and ADDDIFF can be used to answer the questions (Q1)-(Q3), respectively. The generated new samples largely overlap with the content and aspect terms of the original samples, but manage to disentangle the confounding sentiment polarity of non-target aspects from the target, as showed in the examples in Table 1. In this way, we produce an "all-rounded" test set that can test whether a model robustly captures the target sentiment instead of using other irrelevant clues.
We enriched the laptop dataset by 294% from 638 to 1,877 samples and the restaurant dataset by 315% from 1,120 to 3,530 samples. By human evaluation, more than 92% of the new aspect robustness test set (ARTS) shows high fluency, and desired sentiment on all aspects. Using our new test set, we analyze the aspect robustness of nine existing models. Experiment results show that their performance degrades by up to 55.64% on Laptop and 69.73% on Restaurant.
The contributions of our paper are as follows: 1. We develop simple but effective automatic generation methods that generate new test samples (with over 92% accuracy by human evaluation) to challenge the aspect robustness. 2. We construct ARTS, a new test set targeting at aspect robustness for ABSA models, and propose a new metric, Aspect Robustness Score. 3. We probe the aspect robustness of nine models, and reveal up to 69.73% performance drop on ARTS compared with the original test set. 4. We provide solutions to enhance aspect robustness for ABSA models (Section 5).

Data Generation
As shown in Table 1, we aim to build a systematic method to generate all possible aspect-related alternations, in order to remove the confounding factors in the existing ABSA data. In the following, we will introduce three generation strategies.

REVTGT
The first strategy is to generate sentences that reverse the original sentiment of the target aspect. The word spans of each aspect's sentiment of Se-mEval 2014 data are provided by (Fan et al., 2019). We design two methods to reverse the sentiment, and one additional step of conjunction adjustment on top of the two methods to polish the resulting sentence.

Strategy Example
Flip Opinion It's light and easy to transport. → It's heavy and difficult to transport.

Add Negation
The menu changes seasonally. → The menu does not change seasonally. Adjust The food is good, and the decor is nice. Conjunctions → The food is good, but the decor is nasty. Flip Opinion Words Suppose we have the sentence "Tasty burgers and crispy fries," where the sentiment term for the target aspect is Tasty. We aim to generate a new sentence that flips the sentiment Tasty. A baseline approach is antonym replacement by looking up WordNet (Miller, 1995). However, due to polysemy, the simple lookup is very likely to derive an inappropriate antonym and cause incompatibility with the context. Among the retrieved set of antonyms, we only keep words with the same Part-of-Speech (POS) tag as original, using the stanza package 2 which takes the context into account by the state-of-the-art neural Strategy Example

Original sentence & sentiment
It has great food and a reasonable price, but the service is poor.
(Tgt) food:+ price:+ service:− overall: REVNON Flip same-sentiment non-target aspects (and adjust conjunctions) It has great food but an unreasonable price, and the service is poor.
Exaggerate opposite-sentiment It has great food but an unreasonable price, and the service is extremely poor. non-target aspects (Tgt) food:+ price:− service:−− overall:− network-based model. 3 Lastly, in the case of multiple antonyms, we prioritize the words that are already in the existing vocabulary, and then randomly select an antonym from the candidate set.
Add Negation As the above strategy of flipping by the antonym is constrained the availability of antonyms. For those cases without suitable antonyms, including long phrases, we add negation according to the linguistic features. In most cases, the sentiment expression is an adjective or verb term, so we simply add negation (i.e., "not") in front of it. If the sentiment term is not an adjective or verb, we add negation to its closest verb. For example, in Table 2, there are no available antonyms for "change" in the original example "The menu changes seasonally.", so we simply negate it as "The menu does not change seasonally." Adjust Conjunctions As pinpointed in Section 1, 79.6∼83.9% of the test examples have the same sentiment for all aspects. A possible result of reverting one aspect's sentiment is that the other aspects' sentiments will be opposite to the altered one. So we need to adjust the conjunctions for language fluency. If the two closest surrounding sentiments of a conjunction word have the same polarity, then cumulative conjunctions such as "and" should be applied; otherwise, we should adopt adversative conjunctions such as "but." In the example in Table 2, after flipping the sentiment, we derive the example "The food is good, and the decor is nasty" which is very unnatural, so we replace the conjunction "and" with "but," and thus generate the example "The food is good, but the decor is nasty."

REVNON
Changing the target sentiment by REVTGT can test if a model is sensitive enough towards the targetaspect sentiment, but we need to further comple-ment this probe by perturbing the sentiments of the non-target aspects (REVNON). As showed in Table 3, for all the non-target aspects with the same sentiment as the target aspect's, we reverse their sentiments using the same method as REVTGT.
And for all the remaining non-target aspects, whose sentiments are already opposite from the target sentiment, we exaggerate the extent by randomly adding an adverb (e.g., "very", "really" and "extremely") from a dictionary of adverbs of degree that is collected based on the training set. The resulting test example will be a solid proof of the ABSA quality, because only the target aspect has the desired sentiment, and all non-target aspects have been flipped to or exaggerated with the opposite sentiment.

ADDDIFF
The first two strategies, REVTGT and REVNON, have explored how the sentiment changes of existing aspects will challenge an ABSA model, and ADDDIFF further investigate if adding more nontarget aspects can confuse the model. Moreover, the existing SemEval 2014 test sets have only on average 2 aspects per sentence, but the real-world applications can have more aspects. With these motivations, we develop ADDDIFF as follows.
We first form a set of aspect expressions AspectSet by extracting all aspect expressions from the entire dataset. Specifically, for each example in the dataset, we first identify each sentiment term (e.g., "reasonable" in "Food at a reasonable price") and then extract its linguistic branch as the aspect expression (e.g., "at a reasonable price") by pretrained constituency parsing (Joshi et al., 2018). Table 4 shows several examples of AspectSet in the restaurant domain.
Using the AspectSet, we randomly sample 1-3 aspects that are not mentioned in the original test sample and whose sentiments are different from the target aspect's, and then append these to the end of the original example. For example, "Great food

Sentiment Aspect Expression
Positive staff is friendly and knowledgeable desserts are out of this world texture is a velvety Negative service is severely slow dining experience is miserable tables are uncomfortably close and best of all GREAT beer!" ADDDIFF − −−−− → "Great food and best of all GREAT beer, but management is less than accommodating, music is too heavy, and service is severely slow."

Overview
Our source data is the most widely used ABSA dataset, SemEval 2014 Laptop and Restaurant Reviews (Pontiki et al., 2014). 4 We follow (Wang et al., 2016;Ma et al., 2017;Xu et al., 2019a) to remove samples with conflicting polarity and only keep positive, negative, and neutral labels. We use the train-dev split as in (Xu et al., 2019a). The resulting Laptop dataset has 2,163 training, 150 validation, and 638 test instances, and Restaurant has 3,452 training, 150 validation, and 1,120 test instances.
Building upon the original SemEval 2014 data, we generate enriched test sets of 1,877 samples (294% of the original size) in the laptop domain, and 3,530 samples (315%) in the restaurant domain using generation method introduced in Section 2. The statistics of our ARTS test set are in Table 5.

Quality Inspection
We conduct human evaluation to validate the generation quality of our ARTS dataset on two criteria: 1. Fluency: Does the generated sentence maintain the fluency of the source sentence? 2. Sentiment Correctness: Does the sentiment of each aspect have the desired polarity? • REVTGT: Is the target sentiment reversed?
• REVNON: For non-target aspects with originally the same sentiment as the target, is it reversed? For the rest, are they exaggerated? • ADDDIFF: Is the target sentiment unchanged? Each task is completed by two native-speaker judges. We first calculate the inter-agreement rate 4 http://alt.qcri.org/semeval2014/task4/  of the human annotators, and then resolve the divergent opinions on samples that they disagree with. We accept the samples that both judges considered as correct or are resolved to be correct after our check. Finally, we ask the annotators to fix the rejected samples by minimal edit which does not change the aspect term or the sentence meaning, but satisfies both criteria.
Fluency Check The evaluation results on fluency are showed in

Dataset Analysis
After checking the quality of our enriched ARTS test set, we analyze the dataset characteristics and make comparisons with the original test sets.
For general statistics, we can see from Table 6 that the sentence length in the new test set is on average 4 words more than the original, and the vocabulary size is also larger by around two hundred. For the label distribution, we can see that the new test set has an increasing number of all labels, and especially balances the ratio of positive-to-negative  For the aspect-related challenge in the test set, the new test set, first of all, has a larger number of aspects per sentence than the original. Our test set also features the higher disentanglement of the target aspect from the non-target aspects that have the same sentiment as the target: the portion of samples with at least one non-target aspects of sentiments different from the target is 59∼67%, and on average 45% higher than the original test sets. And the portion of the most challenging samples where all non-target aspects have sentiments different from the target one on the new test set is on average 30% more than that of the original test set. The average number of non-target aspects with opposite sentiments per sample in the new test set is on average 5 times that of the original set.

Aspect Robustness Score (ARS)
As mentioned in Section 1, a model is considered to have high aspect robustness if it satisfies both the prerequisite and all three questions (Q1)-(Q3). So we propose a novel metric, Aspect Robustness Score (ARS), that counts the correct classification of the source example and all its variations (REVTGT, REVNON, and ADDDIFF) as one unit of correctness. Then we apply the standard calculation of accuracy. Note that the three variations correspond to questions (Q1)-(Q3), respectively.

Evaluating ABSA Models
We use our enriched test set as a comprehensive test on the aspect robustness of ABSA models.

Models
For a comprehensive overview of the ABSA field, we conduct extensive experiments on models with a variety of neural network architectures.
TD-LSTM: (Tang et al., 2016a) uses two Long Short-Term Memory Networks (LSTM) to encode the preceding and following contexts of the target aspect (inclusive) and concatenate the last hidden states of the two LSTMs to make the sentiment classification.
AttLSTM: Wang et al. (2016) apply an Attention-based LSTM on the concatenatation of the aspect and word embeddings of each token.
GatedCNN: Xue and Li (2018) use a Gated Convolutional Neural Networks (CNN) that applies a Tanh-ReLU gating mechanism to the CNNencoded text with aspect embeddings.
MemNet: Tang  CapsBERT: (Jiang et al., 2019) encode the sentence and the aspect term with BERT, and then feed it into Capsule Networks to predict the polarity.
BERT-Sent: For more in-depth analysis, we also implement a sentence classification baseline that only feeds the sentence without aspect information into BERT, and directly predicts the sentiment.

Implementation Details
For all existing models, we use the authors' official implementation. For our self-proposed BERT-Sent, we use Adam optimizer with a learning rate of 5e-5, weight decay of 0.01, batch size of 32, apply the l 2 regularization with λ = 10 −4 , and train 50 epochs.

Results on ARTS
We list the accuracy 5 of the nine models on the Laptop and Restaurant test sets in Table 7.   Table 7, accuracy is calculated using ARS. 6 CapsBERT hand-crafted the Capsule Guided Routing specifically for the restaurant domain, so it fails significantly. sentence classifier BERT-Sent drops the most by up to ↓45.93%, and the model CapsBERT also drops by up to ↓39.26%. The last subset ADDDIFF causes most non-BERT models to drop significantly, indicating that these models are not robust enough against an increased number of non-target aspects, which should have been irrelevant.

Variations of Generation Strategies
Combining Multiple Strategies Each sample in the ARTS test set is generated by one of the three strategies. However, it is also worth exploring whether combining several strategies can make a more challenging probe on the aspect robustness of ABSA models. As a case study, we analyze the model robustness against test samples generated by the combination of REVNON+ADDDIFF. By comparing the performance decrease caused by REVNON+ADDDIFF in Table 8 and by only REVNON and ADDDIFF in Table 7, we can see that the accuracy of each model decreases by a much larger extent on REVNON+ADDDIFF than either of REVNON and ADDDIFF.
ADDDIFF with More Aspects Some strategies such as ADDDIFF can be parameterized by k,  where k is the number of additional non-target aspects to be added. We select three models (the best, the worst, and an average-performing one), and plot their accuracy on test samples generated by ADDDIFF(k) on Laptop in Figure 1. As k gets larger, the test samples become more difficult. The sentence classification baseline BERT-Sent drops drastically, BERT-PT remains high, and GCN lies in the middle.

How to Effectively Model the Aspect?
An important usage of our ARTS is to understand what model components are key to aspect robustness. We list the aspect-specific mechanisms of all models according to the ascending order of their ARS on Laptop dataset in Table 9. We can see that for BERT-based models, BERT-PT, which is further trained on large review corpora, gets the best accuracy and aspect robustness. More complicated structures like CapsBERT underperforms the basic BERT by 25.08%.
Among the non-BERT models, the aspect position-aware models TD-LSTM and GCN are the most robust, as they have a stronger sense of the location of the target aspect in a sentence. On the  Table 9: Models in the ascending order of their ARS on Laptop. We list their aspect-specific mechanisms, including concatenating the aspect and word embeddings (Asp+W Emb), position-aware mechanism for aspects (Posi-Aware), and attention using the aspect (Asp Att). We highlight for Posi-Aware as it is the most related to aspect robustness for non-BERT models.
contrary, the other models with poorer robustness (9.87%∼16.93% in Table 9) only use mechanisms such as aspect-based attention, or concatenating the aspect embedding to the word embedding.
To summarize, the main takeaways are • For BERT models, additional pretraining is the most effective. • For non-BERT models, explicit positionaware designs lead to more aspect robustness.

Does Better Training Help?
The following three settings explore whether better training can improve the aspect robustness.
Training and Testing on MAMS A recent dataset, Multi-Aspect Multi-Sentiment (MAMS) (Jiang et al., 2019), is collected from the same data source as the SemEval 2014 Restaurant dataset (Ganu et al., 2009). However, the sentences are more complicated, each having at least two aspects with different sentiment polarities.  Training on Adversarial Samples Adversarial training is also a good way to enhance models' aspect robustness. We conducted adversarial training on the Laptop and Restaurant datasets, and analyze its effect in Table 10b. In both domains, adversarial training (Adv→N) leads to significant performance improvement than only training on the original datasets (O→N). On the Restaurant datasets, adversarial training is even more effective than training on MAMS, because our generated samples comprehensively covered all possible perturbations of the non-target aspects, and naturally collected datasets might not be comparable.

Error Analysis
We analyze the error types in the subset that was fixed by human judges. Two most significant error types are wrong antonyms (∼2%), such as "the weight of the laptop is light→dark", and negation which causes grammatical errors (∼1.1%). In future work, we can fix the latter by applying a grammatical error correction system on top of our generation. Also, REVTGT and REVNON cannot be applied to 1.4∼6.6% samples with complicated sentiment expressions which rely on commonsense. For example, "a 2-hour wait" is negative but too difficult to alter in our current generation framework. It needs more advanced models such as text style transfer (Shen et al., 2017;Jin et al., 2019b).

Related Work
Robustness in NLP Robustness in NLP has attracted extensive attention in recent works (Hsieh et al., 2019;. As a popular method to probe the robustness of models, adversarial text generation becomes an emerging research field in NLP. Techniques include adding extraneous text to the input (Jia and Liang, 2016), character-level noise (Belinkov and Bisk, 2018;Ebrahimi et al., 2018), and word replacement (Alzantot et al., 2018;Jin et al., 2019a). Using the adversarial generation techniques, new adversarial test sets are proposed for several tasks such as paraphrasing (Zhang et al., 2019b) and entailment (Glockner et al., 2018).
Aspect-Based Sentiment Analysis ABSA has emerged as an active research area recently. Early works hand-craft sentiment lexicons and syntactic features for rule-based classifiers (Vo and Zhang, 2015;Kiritchenko et al., 2014). Recent neural network-based models use architectures such as LSTM (Tang et al., 2016a), CNN (Xue and Li, 2018), Attention mechanisms (Wang et al., 2016), Capsule Network (Jiang et al., 2019), and the pretrained model BERT (Xu et al., 2019a). Similar to the motivation in our paper, some work shows preliminary speculation that the current ABSA datasets might be downgraded to sentence-level sentiment classification (Xu et al., 2019b).

Conclusion
In this paper, we proposed a simple but effective mechanism to generate test samples to probe the aspect robustness of the models. We enhanced the original SemEval 2014 test sets by 294% and 315% in laptop and restaurant domains. Using our new test set, we probed the aspect robustness of nine ABSA models, and discussed model designs and better training that can improve the robustness.