A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) has attracted increasing attention recently due to its broad applications. In existing ABSA datasets, most sentences contain only one aspect or multiple aspects with the same sentiment polarity, which makes ABSA task degenerate to sentence-level sentiment analysis. In this paper, we present a new large-scale Multi-Aspect Multi-Sentiment (MAMS) dataset, in which each sentence contains at least two different aspects with different sentiment polarities. The release of this dataset would push forward the research in this field. In addition, we propose simple yet effective CapsNet and CapsNet-BERT models which combine the strengths of recent NLP advances. Experiments on our new dataset show that the proposed model significantly outperforms the state-of-the-art baseline methods


Introduction
Aspect-based sentiment analysis (ABSA) aims at identifying the sentiment polarity towards the specific aspect in a sentence. An target aspect refers to a word or a phrase describing an aspect of an entity. For example, in the sentence "The decor is not special at all but their amazing food makes up for it", there are two aspect terms "decor" and "food", and they are associated with negative and positive sentiment respectively.
Recently, neural network methods have dominated the study of ABSA since these methods can be trained end-to-end and automatically learn important features. (Wang et al., 2016) proposed to learn an embedding vector for each aspect, and these aspect embeddings were used to calculate the attention weights to capture important information with regard to the given aspects. (Tang * Min Yang is corresponding author 1 Data and code can be found as: https://github.com/siatnlp/MAMS-for-ABSA et al., 2016b) developed the deep memory network to compute the importance degree and text representation of each context word with multiple attention layers. (Ma et al., 2017) introduced the interactive attention networks (IAN) to interactively learn attentions in contexts and targets, and generated the representations for target and context words separately. (Xue and Li, 2018) proposed to extract sentiment features with convolutional neural networks and selectively output aspect related features for classification with gating mechanisms. Subsequently, Transformer (Vaswani et al., 2017) and BERT based methods (Devlin et al., 2018) have shown high potentials on ABSA task. There are also several studies attempting to simulate the process of human reading cognition to further improve the performance of ABSA (Lei et al., 2019;. So far, several ABSA datasets have been constructed, including SemEval-2014 Restaurant Review dataset, Laptop Review dataset (Pontiki et al., 2014) and Twitter dataset (Dong et al., 2014). Although these three datasets have since become the benchmark datasets for the ABSA task, most sentences in these datasets consist of only one aspect or multiple aspects with the same sentiment polarity (see Table 1) 2 , which makes aspect-based sentiment analysis degenerate to sentence-level sentiment analysis. Based on our empirical observation, the sentence-level sentiment classifiers without considering aspects can still achieve competitive results with many recent ABSA methods (see TextCNN and LSTM in Table 3). On the other hand, even advanced ABSA methods trained on these datasets can hardly distinguish the sentiment polarities towards different aspects in the sentences that contain multiple aspects and multiple sentiments.  4  4827  1283  Restaurant ACSA  4  4738  454  Laptop  ATSA  4  3012  604  Twitter  ATSA  3  6940  6   Table 1: Statistics of existing datasets for ABSA. Size and MM size represent the total number of instances and multi-apsect multi-sentiment instances in the dataset. Each multi-apsect multi-sentiment instance contains multiple aspects with different sentiment polarities.
With the goal of advancing and facilitating research in the field of aspect-based sentiment analysis, in this paper, we present a new Multi-Aspect Multi-Sentiment (MAMS) dataset. In MAMS dataset, each sentence consists of at least two aspects with different sentiment polarities, making the proposed dataset more challenging compared with existing ABSA datasets. Considering merely the sentence-level sentiment of the sentence will fail to achieve good performance on MAMS dataset. We empirically evaluate the stateof-the-art ABSA methods on MAMS dataset, the poor results demonstrate that the proposed MAMS dataset is more challenging than the SemEval-2014 Restaurant Review dataset.
We analyze the properties of recent ABSA methods, and propose new capsule networks (denoted as CapsNet and CapsNet-BERT) to model the complicated relationship between aspects and contexts, which combine the strengths of recent NLP advances. Experimental results show that the proposed methods achieve significantly better results than the state-of-the-art baseline methods on MAMS and SemEval-14 Restaurant datasets.
Our main contributions are summarized as follows: (1) We manually annotate a large-scale multi-aspect multi-sentiment dataset, preventing ABSA degenerating to sentence-level sentiment analysis. The release of it would push forward the research of ABSA. (2) We propose a novel capsule network based model to learn the complicated relationship between aspects and contexts. (3) Experimental results show that the proposed method achieves significantly better results than the stateof-the-art baseline methods.  York dataset collected by (Ganu et al., 2009). We split each document in the corpus into a few sentences, and remove the sentences consisting more than 70 words.

Data Annotation
We create two versions of MAMS dataset for two subtasks of aspect-based sentiment analysis: aspect-term sentiment analysis (ATSA) and aspect-category sentiment analysis (ACSA). For ATSA, we invited three experienced researchers who work on natural language processing (NLP) to extract aspect terms in the sentences and label the sentiment polarities with respect to the aspect terms. The sentences that consist of only one aspect term or multiple aspects with the same sentiment polarities are deleted. We also provide the start and end positions in a sentence for each aspect term.
For ACSA, we pre-defined eight coarse aspect categories: food, service, staff, price, ambience, menu, place and miscellaneous. Five aspect categories are adopted in SemEval-2014 Restaurant Review Dataset. We add three more aspect categories to deal with some confusing situations. Three experienced NLP researchers were asked to identify the aspect categories described in given sentences and determine the sentiment polarities towards these aspect categories. We only keep the sentences which consist of at least two unique aspect categories with different sentiment polarities.

Methodology
We use D to denote the collection of sentences in the training data. Given a sentence S = {w s 1 , ..., w s n }, an aspect term A t = {w a 1 , ..., w a m } or an aspect category A c , aspect-level sentiment classification aims to predict the sentiment polarity y ∈ {1, ..., C} of sentence S with respect to A t or A c . Here, w denotes a specific word, n and m are the lengths of the sentence and aspect term, C represents the number of sentiment categories.
As illustrated in Figure 1, the proposed model consists of an embedding layer, an encoding layer, a primary capsule layer and a category capsule layer.

Embedding Layer
In the embedding layer, we convert the sentence S into word embeddings E. For ACSA task, as-pect category embedding a are randomly initialized and learned during training, while for ATSA task, aspect embedding a are computed as average pooling over aspect word embeddings. We get the aspect-aware sentence embedding E sa by concatenating the aspect embedding a with each word embedding in S:

Primary Capsule Layer
In the primary capsule layer, we get primary capsules P = [p i , ...p n ] and aspect capsule c through linear transformation and squashing activation.
where W p , b p , W a and b a are learnable parameters. The squash function is defined as: Aspect Aware Normalization Due to the variable lengths of sentences, the number of primary capsules sent to upper layer capsules varies from sentence to sentence, leading to unstable training procedure. Extremely long sentences make the squash activation saturate and result in high confidence for all categories; while very short sentences in contrast will lead to low confidence for all categories. To alleviate this problem, we propose the aspect aware normalization that utilizes aspect capsule to select important primary capsules, and normalize primary capsule weights u by: where W n is a learnable parameter.
Capsule Guided Routing Original dynamic routing mechanism (Sabour et al., 2017) suffer from inefficient training due to the iteration procedure of routing. And there is no upper layer information used to guide the routing process, which makes dynamic routing work like a self-directed process.
Instead of computing coupling coefficients between primary capsules and category capsules during routing process, we design a capsule-guided routing mechanism, which leverages prior knowledge about the sentiment categories to guide the routing process effectively and efficiently. Specifically, we use a set of sentiment capsules to store prior knowledge about the sentiment categories. Let G ∈ R C×d be the sentiment matrix, which is initialized with the averaged embeddings of sentiment words. C is the number of sentiment categories, and d is the dimension of the sentiment embedding. We can get the sentiment capsules Z = [z 1 , ..., z C ] by applying squash activation over sentiment matrix and compute the routing weights w by calculating the similarity between primary capsules and sentiment capsules:

Category Capsule Layer
Based on the normalization weights and routing weights, the final category capsules V = [v 1 , ..., v C ] can be calculated as: where s is a learnable scale parameter to scale the connection weights to a suitable level. Following (Sabour et al., 2017), we use the margin loss as loss function for aspect-based sentiment classification: where T k = 1 if and only if a category k is present. m + and m − are the margin hyper-parameters. λ control the loss for absent categories. In our experiments, m + , m − and λ is set to 0.9, 0.1, 0.6, respectively.

CapsNet-BERT
To utilize the features learned from large-scale corpus, we design the CapsNet-BERT model which combines the strength of BERT and capsule networks. We replace the embedding layer and encoding layer of CapsNet with pre-trained BERT. The CapsNet-BERT model takes "[CLS] sentence [SEP] aspect [SEP]" as input, which computes the deep representations of sentences and aspects with pre-trained BERT. We then feed the sentence and aspect representations into capsule layers and predict the corresponding sentiment polarities.

Experimental Setup
Experimental Data In order to evaluate the effectiveness of our model, we conduct experiments on the two MAMS datasets and SemEval-14 Restaurant Review (Pontiki et al., 2014) dataset. All models share the same data pre-processing procedure, and use the same pre-trained word embeddings.

Implementation Details
In all experiments, we use 300-dimentional word vectors pre-trained by GloVe (Pennington et al., 2014) to initialize the word embedding vectors for non-BERT models. The capsule size is set to 300. The batch sizes are set to 64 and 32 for CapsNet and CapsNet-BERT respectively. We use Adam optimizer (Kingma and Ba, 2015) to train our models. The learning rates are set to 0.0003 and 0.00003 for CapsNet and CapsNet-BERT respectively. We run all models for 5 times and report the average results on the test datasets. We fine-tune the hyper-parameters for all baselines on the validation set.

Experimental Results and Analysis
Experimental results are reported in Table 3. From Table 3 we draw the following conclusions. First, sentence-level sentiment classifiers (TextCNN and LSTM) achieve competitive results on SemEval-14 Restaurant Review dataset but perform poorly on MAMS datasets. This verifies that MAMS datasets can alleviate the task degeneration problem encountered in Restaurant dataset for ABSA. Second, most advanced and complex ABAS methods, which achieve impressive results on Restaurant dataset, perform poorly on the MAMS-small dataset. This verifies that MAMS (small) is more challenging than SemEval-14 Restaurant Review dataset. Third, attention based models without effectively modeling word order (e.g., MemNet and AEN) perform worst on MAMS since they lose word order information and cannot identify which part of context describes the given aspect. Fourth, CapsNet outperforms non-BERT baselines on 4 of 6 datasets, showing the potential of applying capsule networks to aspect-based sentiment analysis task. In addition, CapsNet-BERT performs significantly better than other models including BERT, indicating that combining capsule network and BERT can obtain additional improvement compared to vanilla BERT.

Ablation Study
To analyze the effectiveness of the proposed capsule-guided routing mechanism, we conduct ablation study that replace capsule-guided routing by dynamic routing in both CapsNet and CapsNet-BERT, resulting in CapsNet-DR and CapsNet-BERT-DR. From Table 3 (last two rows) we can see that capsule-guided routing boosts the performance of CapsNet and CapsNet-BERT on all the datasets.

Conclusion
In this paper, we present MAMS, a challenge dataset for aspect-based sentiment analysis, in which each sentence contains multiple aspects with different sentiment polarities. The proposed MAMS dataset could prevent aspect-level sentiment classification degenerating to sentence-level sentiment classification, which might push forward the researches on aspect-based sentiment analysis. In addition, we propose a simple yet effective capsule networks that significantly outperforms compared methods.