ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction

Sentiment analysis has attracted increasing attention in e-commerce. The sentiment polarities underlying user reviews are of great value for business intelligence. Aspect category sentiment analysis (ACSA) and review rating prediction (RP) are two essential tasks to detect the fine-to-coarse sentiment polarities. ACSA and RP are highly correlated and usually employed jointly in real-world e-commerce scenarios. While most public datasets are constructed for ACSA and RP separately, which may limit the further exploitation of both tasks. To address the problem and advance related researches, we present a large-scale Chinese restaurant review dataset ASAP including 46, 730 genuine reviews from a leading online-to-offline (O2O) e-commerce platform in China. Besides a 5-star scale rating, each review is manually annotated according to its sentiment polarities towards 18 pre-defined aspect categories. We hope the release of the dataset could shed some light on the field of sentiment analysis. Moreover, we propose an intuitive yet effective joint model for ACSA and RP. Experimental results demonstrate that the joint model outperforms state-of-the-art baselines on both tasks.


Introduction
With the rapid development of e-commerce, massive user reviews available on e-commerce platforms are becoming valuable resources for both customers and merchants. Aspect-based sentiment analysis(ABSA) on user reviews is a fundamental and challenging task which attracts interests from both academia and industries (Hu and Liu, 2004;Ganu et al., 2009;Jo and Oh, 2011;Kiritchenko et al., 2014). According to whether the aspect terms are explicitly mentioned in texts, ABSA can be further classified into aspect term sentiment anal- * Equal contribution. † Corresponding author. ysis (ATSA) and aspect category sentiment analysis (ACSA), we focus on the latter which is more widely used in industries. Specifically, given a review "Although the fish is delicious, the waiter is horrible!", the ACSA task aims to infer the sentiment polarity over aspect category food is positive while the opinion over the aspect category service is negative. The user interfaces of e-commerce platforms are more intelligent than ever before with the help of ACSA techniques. For example, Figure 1 presents the detail page of a coffee shop on a popular ecommerce platform in China. The upper aspectbased sentiment text-boxes display the aspect categories (e.g., food, sanitation) mentioned frequently in user reviews and the aggregated sentiment polarities on these aspect categories (the orange ones represent positive and the blue ones represent negative). Customers can focus on corresponding reviews effectively by clicking the aspect-based sentiment text-boxes they care about (e.g., the orange filled text-box "卫生条件好" (good sanitation)). Our user survey based on 7, 824 valid questionnaires demonstrates that 80.08% customers agree that the aspect-based sentiment text-boxes are helpful to their decision-making on restaurant choices. Besides, the merchants can keep track of their cuisines and service qualities with the help of the aspect-based sentiment text-boxes. Most Chinese e-commerce platforms such as Taobao 1 , Dianping 2 , and Koubei 3 deploy the similar user interfaces to improve user experience.
Users also publish their overall 5-star scale ratings together with reviews. Figure 1 displays a sample of 5-star rating to the coffee shop. In comparison to fine-grained aspect sentiment, the overall review rating is usually a coarse-grained synthesis of the opinions on multiple aspects. Rating pre-2070 diction(RP) (Jin et al., 2016;Wu et al., 2019a) which aims to predict the "seeing stars" of reviews also has wide applications. For example, to promise the aspect-based sentiment text-boxes accurate, unreliable reviews should be removed before ACSA algorithms are performed. Given a piece of user review, we can predict a rating for it based on the overall sentiment polarity underlying the text. We assume the predicted rating of the review should be consistent with its groundtruth rating as long as the review is reliable. If the predicted rating and the user rating of a review disagree with each other explicitly, the reliability of the review is doubtful. Figure 2 demonstrates an example review of low-reliability. In summary, RP can help merchants to detect unreliable reviews. Therefore, both ACSA and RP are of great importance for business intelligence in e-commerce, and they are highly correlated and complementary. ACSA focuses on predicting its underlying sentiment polarities on different aspect categories, while RP focuses on predicting the user's overall feelings from the review content. We reckon these two tasks are highly correlated and better performance could be achieved by considering them jointly.
As far as we know, current public datasets are constructed for ACSA and RP separately, which limits further joint explorations of ACSA and RP. To address the problem and advance the related researches, this paper presents a large-scale Chinese restaurant review dataset for Aspect category Sentiment Analysis and rating Prediction, denotes as ASAP for short. All the reviews in ASAP are collected from the aforementioned e-commerce platform. There are 46, 730 restaurant reviews attached with 5-star scale ratings. Each review is manually annotated according to its sentiment polarities towards 18 fine-grained aspect categories. To the best of our knowledge, ASAP is the largest Chinese large-scale review dataset towards both ACSA and RP tasks.
We implement several state-of-the-art (SOTA) baselines for ACSA and RP and evaluate their performance on ASAP. To make a fair comparison, we also perform ACSA experiments on a widely used SemEval-2014 restaurant review dataset (Pontiki et al., 2014). Since BERT (Devlin et al., 2018) has achieved great success in several natural language understanding tasks including sentiment analysis Sun et al., 2019;Jiang et al., 2019), we propose a joint model that employs the fine-to-coarse semantic capability of BERT. Our joint model outperforms the competing baselines on both tasks.  Figure 2: A content-rating disagreement case. The review holds a 2-star rating while all the mentioned aspects are super positive.
Our main contributions can be summarized as follows.
(1) We present a large-scale Chinese review dataset towards aspect category sentiment analysis and rating prediction, named as ASAP, including as many as 46, 730 real-world restaurant reviews annotated from 18 pre-defined aspect categories. Our dataset has been released at https: //github.com/Meituan-Dianping/asap. (2) We explore the performance of widely used models for ACSA and RP on ASAP. (3) We propose a joint learning model for ACSA and RP tasks. Our model achieves the best results both on ASAP and SemEval RESTAURANT datasets.

Related Work and Datasets
Aspect Category Sentiment Analysis. ACSA (Zhou et al., 2015;Movahedi et al., 2019;Ruder et al., 2016;Hu et al., 2018) aims to predict sentiment polarities on all aspect categories mentioned in the text. The series of SemEval datasets consisting of user reviews from e-commerce websites have been widely used and pushed forward related research Ma et al., 2017;Sun et al., 2019;Jiang et al., 2019). The SemEval-2014 task-4 dataset (SE-ABSA14) (Pontiki et al., 2014) is composed of laptop and restaurant reviews. The restaurant subset includes 5 aspect categories (i.e., Food, Service, Price, Ambience and Anecdotes/Miscellaneous) and 4 polarity labels (i.e., Positive, Negative, Conflict and Neutral). The laptop subset is not suitable for ACSA. The SemEval-2015 task-12 dataset (SE-ABSA15)  builds upon SE-ABSA14 and defines its aspect category as a combination of an entity type and an attribute type(e.g., Food#Style Options). The SemEval-2016 task-5 dataset (SE-ABSA16) (Pontiki et al., 2016) extends SE-ABSA15 to new domains and new languages other than English. MAMS (Jiang et al., 2019) tailors SE-ABSA14 to make it more challenging, in which each sentence contains at least two aspects with different sentiment polarities.
Compared with the prosperity of English resources, high-quality Chinese datasets are not rich enough. "ChnSentiCorp" (Tan and Zhang, 2008), "IT168TEST" (Zagibalov and Carroll, 2008), "Weibo" 4 , "CTB" (Li et al., 2014) are 4 popular Chinese datasets for general sentiment analysis. However, aspect category information is not annotated in these datasets. Zhao et al. (2014) presents two Chinese ABSA datasets for consumer electronics (mobile phones and cameras). Nevertheless, the two datasets only contain 400 documents (∼ 4000 sentences), in which each sentence only mentions one aspect category at most. BDCI 5 automobile opinion mining and sentiment analysis dataset (Dai et al., 2019) contains 8, 290 user reviews in automobile industry with 10 predefined categories. Peng et al. (2017) summarizes available Chinese ABSA datasets. While most of them are constructed through rule-based or machine learning-based approaches, which inevitably introduce additional noise into the datasets. Our ASAP excels above Chinese datasets both on quantity and quality. Rating Prediction. Rating prediction (RP) aims to predict the "seeing stars" of reviews, which represent the overall ratings of reviews. In comparison to fine-grained aspect sentiment, the overall review rating is usually a coarse-grained synthesis of the opinions on multiple aspects. Ganu et al. (2009);Li et al. (2011);Chen et al. (2018) form this task as a text classification or regression problem. Considering the importance of opinions on multiple aspects in reviews, recent years have seen numerous work (Jin et al., 2016;Wu et al., 2019a) utilizing the information of the aspects to improve the rating prediction performance. This trending also inspires the motivation of ASAP.
Most RP datasets are crawled from real-world review websites and created for RP specifically. Amazon Product Review English dataset (McAuley and Leskovec, 2013) containing product reviews and metadata from Amazon has been widely used for RP McAuley and Leskovec, 2013). Another popular English dataset comes from Yelp Dataset Challenge 2017 6 , which includes reviews of local businesses in 12 metropolitan areas across 4 countries. Openrice 7 is a Chinese RP dataset composed of 168, 142 reviews. Both the English and Chinese datasets don't annotate fine-grained aspect category sentiment polarities.

Data Construction & Curation
We collect reviews from one of the most popular O2O e-commerce platforms in China, which allows users to publish coarse-grained star ratings and writing fine-grained reviews to restaurants (or places of interest) they have visited. In the reviews, users comment on multiple aspects either explicitly or implicitly, including ambience,price, food, service, and so on.
First, we retrieve a large volume of user reviews from popular restaurants holding more than 50 user reviews randomly. Then, 4 pre-processing steps are performed to promise the ethics, quality, and reliability of the reviews. (1) User information (e.g., user-ids, usernames, avatars, and post-times) are removed due to privacy considerations. (2) Short reviews with less than 50 Chinese characters, as well as lengthy reviews with more than 1000 Chinese characters are filtered out. (3) If the ratio of non-Chinese characters within a review is over 70%, the review is discarded. (4) To detect the low-quality reviews (e.g., advertising texts), we build a BERT-based classifier with an accuracy of 97% in a leave-out test-set. The reviews detected as low-quality by the classifier are discarded too.

Aspect Categories
Since the reviews already hold users' star ratings, this section mainly introduces our annotation details for ACSA. In SE-ABSA14 restaurant dataset (denoted as RESTAURANT for simplicity), there are 5 coarse-grained aspect categories, including food, service, price, ambience and miscellaneous.
After an in-depth analysis of the collected reviews, we find the aspect categories mentioned by users are rather diverse and fine-grained. Take the text "...The restaurant holds a high-end decoration but is quite noisy since a wedding ceremony was being held in the main hall... (...环境看起来很高 大上的样子，但是因为主厅在举办婚礼非常 混乱，感觉特别吵...)" in Table 3 for example, the reviewer actually expresses opposite sentiment polarities on two fine-grained aspect categories related to ambience. The restaurant's decoration is very high-end (Positive), while it's very noisy due to an ongoing ceremony (Negative). Therefore, we summarize the frequently mentioned aspects and refine the 5 coarse-grained categories into 18 fine-grained categories. We replace miscellaneous with location since we find users usually review the restaurants' location (e.g., whether the restaurant is easy to reach by public transportation.). We denote the aspect category as the form of "Coarsegrained Category#Fine-grained Categoty", such as "Food#Taste" and "Ambience#Decoration". The full list of aspect categories and definitions are listed in Table 1.

Annotation Guidelines & Process
Bearing in mind the pre-defined 18 aspects, assessors are asked to annotate sentiment polarities towards the mentioned aspect categories of each review. Given a review, when an aspect category is mentioned within the review either explicitly and implicitly, the sentiment polarity over the aspect category is labeled as 1 (Positive), 0 (Neutral) or −1 (Negative) as shown in Table 3.
We hire 20 vendor assessors, 2 project managers, and 1 expert reviewer to perform annotations. Each assessor needs to attend a training to ensure their intact understanding of the annotation guidelines. Three rounds of annotation are conducted sequentially. First, we randomly split the whole dataset into 10 groups, and every group is assigned to 2 assessors to annotate independently. Second, each group is split into 2 subsets according to the annotation results, denoted as Sub-Agree and Sub-Disagree. Sub-Agree comprises the data examples with agreement annotation, and Sub-Disagree comprises the data examples with disagreement annotation. Sub-Agree will be reviewed by assessors from other groups. The controversial examples during the review are considered as difficult cases. Sub-Disagree will be reviewed by the 2 project managers independently and then discuss to reach an agreement annotation. The examples that could not be addressed after discussions are also considered as difficult cases. Third, for each group, the difficult examples from two subsets are delivered to the expert reviewer to make a final decision. More details of difficult cases and annotation guidelines during annotation are demonstrated in Table 2.
Finally, ASAP corpus consists of 46, 730 pieces of real-world user reviews, and we split it into a training set (36, 850), a validation set (4, 940) and a test set (4, 940) randomly. Table 3 presents an example review of ASAP and corresponding annotations on the 18 aspect categories. Figure 3 presents the distribution of 18 aspect categories in ASAP. Because ASAP concentrates on the domain of restaurant, 94.7% reviews mention Food#Taste as expected. Users also pay great attention to aspect categories such as Service#Hospitality, Price#Level and Ambi-ence#Decoration. The distribution proves the advantages of ASAP, as users' fine-grained preferences could reflect the pros and cons of restaurants more precisely.

Dataset Analysis
The statistics of ASAP are presented in Table 4. We also include a tailored SE-ABSA14 RESTAU-RANT dataset for reference. Please note that we remove the reviews holding aspect categories with sentiment polarity of "conflict" from the original RESTAURANT dataset.
Compared with RESTAURANT, ASAP excels in the quantities of training instances, which supports the exploration of recent data-intensive deep neural models. ASAP is a review-level dataset, while RESTAURANT is a sentence-level dataset. The average length of reviews in ASAP is much longer, thus the reviews tend to contain richer aspect information. In ASAP, the reviews contain  5.8 aspect categories in average, which is 4.7 times of RESTAURANT. Both review-level ACSA and RP are more challenging than their sentence-level counterparts. Take the review in Table 3 for example, the review contains several sentiment polarities towards multiple aspect categories. In addition to aspect category sentiment annotations, ASAP also includes overall user ratings for reviews. With the help of ASAP, ACSA and RP can be further optimized either separately or jointly.

Problem Formulation
We use D to denote the collection of user review corpus in the training data. Given a review R which consists of a series of words: {w 1 , w 2 , ..., w Z }, ACSA aims to predict the sentiment polarity y i ∈ {P ositive, N eutral, N egative} of review R with respect to the mentioned aspect category a i , i ∈ {1, 2, ..., N }. Z denotes the length of review R. N is the number of pre-defined aspect categories (i.e., 18 in this paper). Suppose there are K mentioned aspect categories in R. We define a mask vector [p 1 , p 2 , ..., p N ] to indicate the occurrence of aspect categories. When the aspect category a i is mentioned in R, p i = 1, otherwise In terms of RP, it aims to predict the 5-star rating score of g, which represents the overall rating of the given review R.

Joint Model
Given a user review, ACSA focuses on predicting its underlying sentiment polarities on different aspect categories, while RP focuses on predicting the user's overall feelings from the review content. We reckon these two tasks are highly correlated and better performance could be achieved by considering them jointly.
The advent of BERT has established the success I used to like the food of this restaurant, but the taste is not as expected today.
When there existed a sentiment drifting over time in the review, the most recent sentiment polarity is adopted.

Implicit sentiment polarity
The restaurant was far worse than the dinning hall of any five-star hotel, considering that the meal for two people only cost 500 CNY in a five-star hotel.
Some reviewers express their polarities in an implicit manner instead of expressing their feelings directly. The implicit sentiment polarity is adopted to complete the annotation.

Conflict opinions 这道菜有点咸，但是味 道很赞。
This dish was a bit salty, but it tasted great.
When there existed multiple sentiment polarities toward the same aspect-category, the dominant sentiment is chosen.

Mild sentiment 饭菜还可以，不过也算 不上特别好吃。
The food was okay, but nothing great.
The "neutral" label applies to mildly positive or mildly negative sentiment (Food#Taste, 0)

上次去的一家店很难 吃，今天来了这家新 的，感觉很好吃。
The food of the shop which I went to last time was very bad. Today I came to this new one. I felt very good.
The review mentions restaurant that the user has visited in the past. We only focus on the restaurant being reviewed (Food#Taste,1) of the "pre-training and then fine-tuning" paradigm for NLP tasks. BERT-based models have achieved impressive results in ACSA Sun et al., 2019;Jiang et al., 2019). Review rating prediction can be deemed as a single-sentence classification (regression) task, which could also be addressed with BERT. Therefore, we propose a joint learning model to address ACSA and RP in a multitask learning manner. Our joint model employs the fine-to-coarse semantic representation capability of the BERT encoder. Figure 4 illustrates the framework of our joint model. ACSA As shown in Figure 4, the token embeddings of the input review are generated through a shared BERT encoder. Briefly, let H ∈ R d * Z be the matrix consisting of token embedding vectors {h 1 , ..., h Z } that BERT produces, where d is the size of hidden layers and Z is the length of the given review. Since different aspect category information is dispersed across the content of R, we add an attention-pooling layer  to aggregate the related token embeddings dynamically for every aspect category. The attention-pooling layer helps the model focus on the tokens most related to the target aspect categories.
and r i ∈ R d . α i is a vector consisting of attention weights of all tokens which can selectively attend the regions of the aspect category related tokens, and r i is the attentive representation of review with respect to the i th aspect category a i , i ∈ {1, 2, ..., N }. Then we havê Where W q i ∈ R C * d and b q i ∈ R C are trainable parameters of the softmax layer. C is the number of labels (i.e, 3 in our task). Hence, the ACSA loss for a given review R is defined as follows, If the aspect category a i is not mentioned in S, y i is set as a random value. The p i serves as a gate function, which filters out the random y i and ensures only the mentioned aspect categories can participate in the calculation of the loss function. With convenient traffic, the restaurant holds a high-end decoration, but quite noisy because a wedding ceremony was being held in the main hall. Impressed by its delicate decoration and grand appearance though, we had to wait for a while at the weekend time. However, considering its high price level, the taste is unexpected. We ordered the Kung Pao Prawn, the taste was acceptable and the serving size is enough, but the shrimp is not fresh.
In terms of service, you could not expect too much due to the massive customers there. By the way, the free-served fruit cup was nice. Generally speaking, it was a typical wedding banquet restaurant rather than a comfortable place to date with friends.  Figure 4: The framework of the proposed joint learning model. The right part of the dotted vertical line is used to predict multiple aspect category sentiment polarities, while the left part is used to predict the review rating.

Rating Prediction
Since the objective of RP is to predict the review rating based on the review content, we adopt the [CLS] embedding h [cls] ∈ R d BERT produces as the representation of the input review, where d is the size of hidden layers in the BERT encoder.
Hence the RP loss for a given review R is defined as follows, The final loss of our joint model becomes as follows.
loss = loss ACSA + loss RP

Experiments
We perform an extensive set of experiments to evaluate the performance of our joint model on ASAP and RESTAURANT (Pontiki et al., 2014). Ablation studies are also conducted to probe the interactive influence between ACSA and RP.

ACSA
Baseline Models We implement several ACSA baselines for comparison. According to the different structures of their encoders, these models are classified into Non-BERT based models or BERTbased models. Non-BERT based models include TextCNN (Kim, 2014), BiLSTM+Attn (Zhou et al., 2016), ATAE-LSTM  and Cap-sNet (Sabour et al., 2017). BERT-based models include vanilla BERT (Devlin et al., 2018), QA-BERT (Sun et al., 2019) and CapsNet-BERT (Jiang et al., 2019). Implementation Details of Experimental Models In terms of non-BERT-based models, we initialize their inputs with pre-trained embeddings. For Chinese ASAP, we utilize Jieba 8 to segment Chinese texts and adopt Tencent Chinese word embeddings (Song et al., 2018) composed of 8, 000, 000 words. For English RESTAURANT, we adopt 300-dimensional word embeddings pre-trained by Glove (Pennington et al., 2014). In terms of BERT-based models, we adopt the 12-layer Google BERT Base 9 to encode the inputs.
The batch sizes are set as 32 and 16 for non-BERT-based models and BERT-based models respectively. Adam optimizer (Kingma and Ba, 2014) is employed with β 1 = 0.9 and β 2 = 0.999. The maximum sequence length is set as 512. The number of epochs is set as 3. The learning rates are set as 0.001 and 0.00005 for non-BERT-based models and BERT-based models respectively. All the models are trained on a single NVIDIA Tesla 32G V100 Volta GPU. Evaluation Metrics Following the settings of RESTAURANT, we adopt Macro-F1 and Accuracy 8 https://github.com/fxsjy/jieba 9 https://github.com/google-research/ bert (Acc) as evaluation metrics.

Experimental Results & Analysis
We report the performance of aforementioned models on ASAP and RESTAURANT in Table 5. Generally, BERTbased models outperform Non-BERT based models on both datasets. The two variants of our joint model perform better than vanilla-BERT, QA-BERT, and CapsNet-BERT, which proves the advantages of our joint learning model. Given a user review, vanilla-BERT, QA-BERT, and CapsNet-BERT treat the pre-defined aspect categories independently, while our joint model combines them together with a multi-task learning framework. On one hand, the encoder-sharing setting enables knowledge transferring among different aspect categories. On the other hand, our joint model is more efficient than other competitors, especially when the number of aspect categories is large. The ablation of RP (i.e., joint model(w/o RP)) still outperforms all other baselines. The introduction of RP to ACSA brings marginal improvement. This is reasonable considering that the essential objective of RP is to estimate the overall sentiment polarity instead of fine-grained sentiment polarities.
We visualize the attention weights produced by our joint model on the example of Table 3 in Figure 5. Since different aspect category information is dispersed across the review of R, we add an attention-pooling layer  to aggregate the related token embeddings dynamically for every aspect category. The attention-pooling layer helps the model focus on the tokens most related to the target aspect categories. Figure 5 visualizes attention weights of 3 given aspect categories. The intensity of the color represents the magnitude of attention weight, which means the relatedness of tokens to the given aspect category. It's obvious that our joint model focus on the tokens most related to the aspect categories across the review of R.  Figure 5: Attention visualization example. We only show attention weights of 3 aspect categories for beauty. The red text span "..With convenient traffic..(..交 通 还 挺 方 便 的..)" is related to Loca-tion#Transportation. The blue text span "..the restaurant holds a high-end decoration..Impressed by its delicate decoration and grand appearance though..(..环境 看起来高大上的样子..装修还不错，很精致的装 修..)" is related to Ambience#Decoration. The green text span "..but quite noisy..(..特别吵感觉..)" is related to Ambience#Noise. The intensity of the color represents the magnitude of attention weight.

Rating Prediction
We compare several RP models on ASAP, including TextCNN (Kim, 2014), BiLSTM+Attn (Zhou et al., 2016) and ARP (Wu et al., 2019b). The data pre-processing and implementation details are identical with ACSA experiments. Evaluation Metrics. We adopt Mean Absolute Error (MAE) and Accuracy (by mapping the predicted rating score to the nearest category) as evaluation metrics. Experimental Results & Analysis The experimental results of comparative RP models are illustrated in Table 6.  (Kim, 2014) .5814 52.99% BiLSTM+Attn (Zhou et al., 2016) .5737 54.38% ARP (Wu et al., 2019b) .5620 54.76% Joint Model (w/o ACSA) .4421 60.08% Joint Model .4266 61.26% Our joint model which combines ACSA and RP outperforms other models considerably. On one hand, the performance improvement is expected since our joint model is built upon BERT. On the other hand, the ablation of ACSA (i.e., joint model(w/o ACSA)) brings performance degradation of RP on both metrics. We can conclude that the fine-grained aspect category sentiment prediction of the review indeed helps the model predict its overall rating more accurately.
This section conducts preliminary experiments to evaluate classical ACSA and RP models on our proposed ASAP dataset. We believe there still exists much room for improvements to both tasks, and we will leave them for future work.

Conclusion
This paper presents ASAP, a large-scale Chinese restaurant review dataset towards aspect category sentiment analysis (ACSA) and rating prediction (RP). ASAP consists of 46, 730 restaurant user reviews with star ratings from a leading e-commerce platform in China. Each review is manually annotated according to its sentiment polarities on 18 fine-grained aspect categories. Besides evaluations of ACSA and RP models on ASAP separately, we also propose a joint model to address ACSA and RP synthetically, which outperforms other state-ofthe-art baselines considerably. we hope the release of ASAP could push forward related researches and applications.